— 





Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


1986 


The Synergistically Integrated Reliability 
architecture: a reliability analysis of an 
ultra-reliable fault tolerant computer design. 


Nelson, Ronald J. 


http://ndl.handle.net/10945/22139 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 

| (8 D U DLEY research materials and institutional publications created by the NPS community. 
«ist eae Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS's first 
WN KNOX appointed — and published -- scholarly author. 
“HHH «| LIBRARY 





Dudley Knox Library / Naval Postgraduate School 
411 Dyer Road / 1 University Circle 


http://www.nps.edu/library Monterey, California USA 93943 








Or al 
Pee ee —Cti(iti‘izd — ee ee 
ha Pe re vt 


SSB th PPT 
sit 


Pat beankty 
Piet) rh t ny 
ay Sy ‘ 

apo ar ae 


eee 
Ui teint a ¢ 














Se 




































ae. 
a eA 
(PORT ORE Y agar a Wy WD tae Se PE 
4 ene Ln ‘a CL ay ~ 
4 vme 5.2 are kay 
“eye 
tLe Te ey 
Meta yh, 
Pada Te me 
ep code tee ROE poate 
SOPRO ten eT ba add Oe ears 
bb S TF) ih ow) Lr ae Oren | 
roy) 
LN e 
Ra. 
ost) Po 3 
sett bie UT 
a by Ld padre NYS 
on FIA HG. at2: 
eT 


















































































































beste Tt UY oe 
LD rea 
ook oY 
rae Moe: Le ee yer Lauri st a yd 4 Naa 
eta att Scr te pir te Py were a LY er pet ; 
: bead baa rey AION MED Weary Ad Ae bras ayn hy ie Eee 
Grama ae Ts ee Party a ay eee BLATT PTT PTS Ce te 
‘ LoL Re ST ie te Les Cr rae DEES BY 
aaah AA Oe, ars TAC aT Pr corer MAL oi TE Vi i 
a ee a OE nr Ln eae rat aot Oo eee CLs 
ae bret pC” ih 4-2 Yea G eae Pe Mdm, altel te 
Ce ate Phin 2 ae Le TT 
“Stages any OPT) oe) 
Pl bal | ria 
Bh 
ra) i ‘RA ¢ a 
» » 
mre Uh, Merron iG SLT eo 
d he tig. pray 1d mL og eee LEE re bol.! 
“oye beeen Mee MTT RT t -Oadeuen ae ne 
rey bade te eo Ne ag Sart Le mee bY TT ae} 
eres rH EE ON ah oe oer 
vaagat oo ant 
whit! 
Sheen 





Choe RT SU ho 1) 
td atta 
a oe | 
Bq: 


Care ik to 

ee ok ro 
ae) | 
Mia Fi 


vO tae ey, 

Le Sea wvetiovds 

be et Bs 
on 











, sou 
3 - Cee Bere ema 
| e * 5 + ray 

oa, 









F ayers 
aot a 





Ff eek ge 







































































































































































































































ora Dae 
t 2 ar We Le) Lola Po eae 
SOY Os Aese.g ett a oe 
AL oe rity ws bee Eo Cs eer) ord peti Lane ey oe 
dose rere . Piatsan acs E the cett rae TT tre OR Fhe My. 
paid DUT Re rata ET ethealatiree th 
' faves By 
ALS Ce ee 
eee babetee sta ely J 
AYU adem y Leer a i bs it a 
ony . Lethe MOUEe i rare Le oh Bt) 
eae Paredes Pa ety (8 eer Tie eebeded TA 
beh dh et oe Cw CW oar rere ees te 6 ee A eh ORGS ge 599 S99 Ae 
eet Pe P ip ebteaa oe Ty rs DOCSIS hadi te wrt ara, be ee 
Nir tee re ae . hee Se nee lian Lan et Be Pa wart OP ear Pat 
teen Pe SNC Aen CP arete rect eee harrrgs nego Ua 
eS a ae or eg SL et ae rey ; reas he Mi Chae aa . 
rie MU Sor Categ sine g Lo Sea Oh aed or Lee Pie Ter oe 2q st 
Sr ae wa ba Sn Wirt baba SLT PO LY ers or j pete ee ee eer 
Pi ad atc LRT eat ale PEt ahead aren ae - ae eae nearer are 
rit ee sy , rt SP epee tol Ur tt oP pene ae Pte ert 
afr am Lh ey Corey, ’ ae eT rar ar ite Dokdo “03, ee lee oF ee i ny Leditiediinn ea! 
Wt ens tL Lie eae TB Pee ae Prater ey Sree ise oS eer} a bee Rai Se inal Pan SP oe 
ee} re are pee de eye ee hpi ee aba Le Ue mera ieee foo 4 eee. She a ees Lat is Dery bo wae Ok PS] AWirene o, 
Ce 3 Br « Se 4 rah Ppp a Se iad path ore) Lr eee Le oe Ge coy chips on IE “fee nd ae wees OA Bmnt yy ty. a erry 
1 a ae <r ed LC. Hine Lat ey ee tar ape, ut, ¢ Pore STAM eit ar beet kal he Pe ae eee Lava het a 
ey SP so Ae es Pheer ee el oA “+ Wh Ute Sore oleh Ie MET) ORAL Briegt lL wetpae ener aeration tee 
PRR ee eeT) “HUM qin gs aT on Lichdeetaa ett 
i LS ee Ge aa Th Ot ace hl Cae 
Cid Pt Pe as * ry COLL Cy ar eras A i . a LiLo ete ety Seeee py aT | 
et ee eae 7 ~ ae EC Be Sw Oe wm, 
“ Le Set ee Pry a eer were a 
F Mord he Hae Cmte Le eT wet 
ws aur ta tere an 
eign tone Rebel SET L 
Pr ahr nats oy dee 
on er Leet xe 
“hye ete 

















at on 
Sree | 
he a Yh tr tre 
teal | bh cae aT Td 


Pa ede Oe 
he Te 





ee ee 
Cy alta td Pees rd 
bia Te a Ln 
ed, ae eT es 
ed et ay 
i tas Le hte y 
ea oe 













a ee 
sr) 





ay 
“tee ene. 




































eth en Ty 
Midi cednt ana er Si Tt i 
, Cary ih eT St “Shee oy 
iv, buhay Lor rs 
C c . 
laren wT ype be, 
ace eae 5 ary <9 tay a, 
‘ ‘ at ry an CLT ns Sa Le eae aay i . 
de ers Cy ee ae aed ; aD ars ee es Prat it mts abe en Oe 
Lead ae Te Waa Le Lage * eh tt, Lae ee en a OL Te aU ears re 
OP ete es Mh ae se DGS Sypris ae) earn CoO Ca Pd 
ee SAN Tork e ee, Pa 5 r aes 
a] Leary Lae a Ee te Teed o 
ul bh SS Or ame cry . 
Li hell ee i 
ONO oe. 
* VPeue tee re 
Looe | 
4 he bee 















Ua 9 ee 
Ai aes “hee 


Le Rey Loe 


















































































































































































































IAM rye. 
Rirtaidast neta TT Te 
Reskribldahtaetota ae eee a 
v1) te eR ene ra) il ate 
aed LT) Ll a Ty LL Pt OP ore ey bib Le eee beamed Plan a 
ray or rt a apd hintig cc PT Tet Ore 
hia in to ee elite ds | fn Rate tay 
ih ty pach tt tL re 
PTE DO bees were eee Tear cre 
¥ Peers Man rt G eee ce ere Le an tabotaeet te TT ee Mean 
Oe ee dhl a Fen bg ke ny Rene hey 1» Se oPeeeb omy Pier a rehniaretemtas Spelt tT ovr i tere 
vat ey Pete eis ce . Fi oa Seren ST eorare ie a ery Py 4 ie Sty eae Dn Tee ee he sO, bedi collients LL es beAiriae en 
oe hee Ly oer ae Ce 2 OT Poe CCU ee a od EAs qey yy Nib Sot eT eee become Pas ‘ Rip siamietar point eter taste 
ee Le Shootin a eee ; Oe aren Pe oi re Per ate. Det ete, tuts Mideast ae beat ny bei cite ee Raiech ara Tees 
Ls thal See seen St O ar Cntr rae ear are f c et See oe TS Pad pit a io. Daly Py AM mry ies epttintaal Meter nlrb al 
deb ae Len) be eon err Loe ain rs eyo 5 ws pied Male tt Ta tees oan a a 
Fe cal aie y ae a3 Cora Nae oUt oo LL ee ry es cd WO ayy a ae Ls een tT \ os ‘ . Sete hrm ceneare 
iB i a Soule ente pid ot Ao od Lt ar a th Pry + . Bak Hs ss nt 
aie rate ceea, oe et 2a ri a ed ior Cs ars A are Fy ms ; ch har rr ety WU Pew WHR ale tee, 
Pes sae pears een A asa Cie Pe Tes paige CE CU erin Lanny ree a pe ens z : Sriaate teins Mahe eR ahs sha put lh 
bees Lan bth Oe Te Car pelea ere ain, at hae ry a See ay Jeane ee Oo aaah vey 4 FS ne US a ela) a Ba ah rept 
babeh TY a ate el by TY br UL cr Lies toe wey "hOamratieane ey. ee Ty rr bates Jee on Te eterseniees tiatte SESE Od) bettas Lit. ee ro 
pole eae - ea i Abb (seg elena et A Ch i ae ars CTY bi tah its War wy PY ME ent mul rene Daa See tis a ee eae 7) Cores 
non Stet ae beacadel eee ter pe Ande eee Reel ee Ped AT pad Js pas a bie beh Dar ey a peas Thee Niels ha nia ht oe TT fo SE ree 
be n ie % J o ty, 
Te 4*et an LL ety ta ihe ae Ty tog ' = Medied Sa et TY ary ore er Rate ane siepat Rath ran restate 
praetor Spehyphcu Dohar an. ses eed ToT EP DUC rr eee Nae et Salta Te Freee ern) 
Al are ile BT a Le Lae eae ae ert ha oO Ly FE guy Tw) Wat, oe TT sores Wareatnet : DPF al sae 
* amr thr es # tet ee gy cn Wits omy, FUN bees Ay Oe Pcie oe co Cea Raa eae eee h t il tat al Po 
TT ee SOLO Aehiar: aA Lh Des ae As ay: hd a etal SCs Cy 3 i Oo eee LO tere 
baa eat at oe Leal Se ey tehe Seah) het ee CN dh ee ae my ae a ee CP ee eres 1; i" ‘ 
3 San ete et See Te aD eT Seto Lia tT ah ed any tra ie a Te Tee bhcdaia etbuae Rosary 
Aa eb eee as ote ey EME Ca ee ST ee COE SUS SEDI Cr, sparta | ee RN LoL entry ees erry 
ed Writarre-dna ager arneas fom toe Cre eet ET Me bad ae Ae OP tae "TIN Aves tee 
hide eT aca ve es are Lc eee Cees 
emai et A atte 
AL ot cd Hier, 
ed ee da 








and or, re “ 








rr a 

































Saen a, Lehi een Senate 
ede ste Y OR AW emt, my Rete lo teen Ti 
1 9e rare Dee rh ES use, i TY Lat 
ae Be A ad i Ty Se eT be ai atoth Yon th 
ilies SOT Serra Ladies te io he ibe Bream bettie late cts 
345 oe « 
: rr DONC rr ere cry Be. = pC hell Olay. 
TN Suan. Le Stic eae ans ray in 
. rT . oe a u 
4 Pie Prva See Breeds t, pitch eT O82 het ee La TE “ bedatliahe ch ey 
id os) Oe Se FSar fa ey OR a nae balled a Wnbnae so bt G 
at ee abil: CRETE CET ga rs Meineke! ia nian ae hs G 
cL ae bea Tes aoe Ted i 
ti ae oe ieiedia Loe s Caer an re 
pede nS a 
allt | ated | Te 
sit hence cel Ted ce ed een te a ee 
pitied Te bens ote ar Pee 
aoe, al Ee wet 
ol at ee ee he 
hanetl oll Tt ell abt ke 
pinata ae Rnd 
iiemans att bade te” oe 
Std eee Peri 

lied dT) 





Red Ly tet a PY ay 


oon es ee 
er aa 


ee tar a ee 
Ce rae “ 
cay Cd ad 


ee Vae 
deed ae 2 



































































Pbnitetne ae ek Ore 
petal doctras ge ti 
eer tte ori 4 Neioebh te ee bt et 
roe) Y beiledall Satan a TIT et tom Qos tel OLY bat d bhatt eee 
i od ae a date ad aoe Pe eh LD ehh er Tt eee *y 
be ag Fee yy Me 8 wy, 7 le Poo bee eT 
oS ae Y he Le ery et. oe ee ere Sate ta 
nara 1) ade tt eT ee Y a 
See er) 
ila Wh Wal APS . 
LA teers eee eT het hb oeage ares 
bee ae. FL) a | Att ae ire Rim 
a oY ae Te 
chi A tle forte one neg, Par) eT een 
aad ey rere 
La oat ee i 
tee 
ST er Sle et ete aay 
ii ee ar ee 
alts bees ee] Car re 
Ealteloa ete 





“WON IS v0 vn 
ie er rd 





bette La he 
eta Le el be dahatt 





"im ae ry 
3. 








nen 
eR trey 
fe a 


=> et a 
ay ray 





a arty 
Retoe ten oh TTT 
Sitios at MRT Rey 
aie te LY 


i at eT 


i ae, 
Lie’ 
rrr) 


Ty 





Le Hm Pha, 
ee 








Same he 
a 




























Se bt. te 
Seba athe e Ty ee 
bediedeteiin ts ST pile AT ne a ee ptithedaiet de cae ee 
ote aw ae ce ea dem ee ot 
be ee eS tatty 
at TL eon oe 
hel oe at tee . 
a See a ee (Oy er Pps! 
Ce Pay Cee | ery an a Le 
ts ee oT -- 
oe a we oe OS Ip oe ise m 
Mt te te eg ¢ 
ree} irre? Tt 
. 2 
PR Peer ween wWatrosPr eg Ce 
pT a ae?) Pa “4a? Cab fatnens 
ahha «I edetentet et ated fue e 





ro) 
Late ry eed ca) 
peasant Aa 


Rare 








ee re rt) 


ih Denes fo OT 
rd 


Ld Lo ed 
rs 


Pe, ers 








ind 
Crd 


























bbl tte CT 
phat nan ee eG hela) heheheh or Leah Adee. 
ews te et ete Tt 
AVENE mary tee A LTT 
oth. oe 
betel hed Koa a een 
“ae a 
a LPTs 
Sede ha St JUL LC lay Toy wy es ett] 
*eegenea, 2 Oa atrot. 5 
bidet Ce ied Loe Le er 
itt At we EY tn La pti ¢ A 
Mame eer) ona, Peat 
eee es Cee te 









a oe Leen 


UE ae erry 
MECC Ter ery Cus 
han Cor tare @ 









Bete aes. 





ee eee 
mera Neha orn ot tt eee 
2 
Ste ee 


Rte he ee 
Sittin te Ln TTR a ene 
Lad te AT et ad 


a 





















- Xe 
9K OL TOS ian ry 






trea 
i bdiat tie TTT Le bite dS) Aad 
aha eT 1) Xe in oe Y oe Y 
ah TLS 





+4 one 
f Proe Lo et 
ry ira 


ae rey 
bee he Te Pay 
baedt hws, Sne ia TY Por rts 
Wheater oe. * s we an Ty cay 
gine RL Lat pie baie} he La Steerge ay ary 
LU or eae Per ory my | pee OT er 
Mn ed See) ate or OURS. tee 
Fee wan g ¥ 
a. ey 





CT 
bia CLT et 
aed a eT a 














Ce Pelee et 
Lhe oer eT ens 
blk ite het 
eth ee bine Lee 
Se eae sk te Ste tt 
worm, “ahi eeseie 





bk eon 
~ “een Cee td te 
ao Li re 
@ hw area gy 
Ley e 
Ce Ch og) es 
Lae Oe eee er 

fF Fy Moh at Pre 
ae. Le ne ane UO UT Mar cer ds Care 
Car) Ne aS LS Yee 
we oF Beet y tid Ea 
iL dee ee a Year "ame 





Creeley 9 
ee 





Piene  » 

7 qe 9 belie edhe ee EO) 

i Le TT ee 

oe ‘t otlye «gg 

ad Soe et i eae 

ere Fav ate © dade Leta iar Laer Gar 
be od a eee Lae 4 


hohe. Y a Cee y) 
le doa Le ee 


bee te aT eT 4 4s 
+ Powea.- baal Ln Te a a 
ae tatu aa he re Lt Le Se are 
or PAede ting aia 
ida a) et 4° # Cptne * eee 4 ry Ot ea 
eh. a aT TY oo et LLP 


SD 
. 





au da ge 


fl doletatied Yas SO) 
Ce ee 
belek LL Le TTS 


Nd h Is a ee 
Sth, NP easme 





he ee 
ot td blab The 14 ik ey 














n ar 
a Te ee Te 
“Pee 
ve ama eiateniee uF 2 


ate tate 
Smise @ 


La 24 
al 


She Te LAT ers 
es tee 
wa 4 Creed 


hol ha a ae 
StL oe 


is aie ae ey 
Siiehoh ah aa Th ah Lt rere ne 
arr 


Se Modem en ag 
awe Med tet 2 





hh 
ee hae ha! 









laa a Pky 
La | bh ed 





ed 
aed 


oa * 8 ng 


Fo tn he oe 







hed , 
ed reper oe on re 
Pee wey Ae 
Vem e wt nly 
Pa eT els ae yt) 4 
rT ee ArT) 
are) TOO y merge 
a. ae | ere ee LA 
Shae meen Bees + ae wet, ME) 
ek er Py Om Nee 
a | oo en iat TY] 








Me Owe dome, 
Silat hata teh het petite ee et 
bploe Re Peet oe Delete tee he, 
Late Tee beta deed Tan athe 





a 





sth 

arr’ 
G o rr] 
all ae Le 






pia I ae UT a tae eae 
ee oe La a ey a nay Lan Oe Ive rey 
eee eis eae efma 
250 8 eee mw inn en at ed 
elie od I er 


Lt Ce Tay 
SPA fnew y 








er ey freed 
iat hi Lo KLE a Tat) 
Se | ie oerer 7 ir Lhe | 
hte | Cheat a ot Lar 

bah dice Yo ea 

e Sper Myce se any 

Ts Se ay + a ae ois 
' Lan 







Sade en 








can ae eT 
ary 









=e 


oe Smpre 8, ta diad 
Pe Pag, 
i re 


bi hd Yee} 
mee hat at ad 





ibaa DR TT 9 nT 
saab int Tt oe eet 
ye tee yey ert a 
we vam sp ekaeey. 








ee ar cat ee ed 
bed he ae et ee 
bedietaten Lt) Loe oe 
Let tore pene Pe er a 
i en eT Tae a ery a 
beh iat LS fo ea 
Lied oy it eT) 
8 Rwmee, 
bah it a a] RSP aati 
Spa he. Tat eee 
a Ne ty 
haat Le Tere 
en SOA Teen ot 
Skt Senet ar 
i] hee. ie 
Tht es 





mee 
oma bl hn ee 
vem ee tet 
bebe Det ton Th oe tl et 

a1 
Shao] 


or 








Up ies 
Sr Le ae 
ory Oo 






rhe 





ry 
i ek <1 Sy 
heh ne ae 








der 
pie Lee Lee 
et bt Ae 
rhe Pee 
ed 
Norte Qe 
pitts ee) na 
Si tat tet 


eee e tore ae 
eye Cd 
eae ee 





bah TY 
iad er oF] 


bed de eee 


Fate eWasiewee aa 
bel eel LT 


Td 
a 
Lite are 











Pra) co 
Oerren 
z 4 ehens 
eed ee 3 a a ar ties Peers 
a) rey © @.2 

a A i oe ee) 
Oia #4 “ 
Lae Cr ara a cr er ery 
Lith a Dy EL at i) 


lls se dL IM Jue hr foarte fal i 
a a a 


oe re 





- 





=“ tee rer 

pa F Uae 4 Jel eT Te Ser 

Tae eee ra ne ay 

Lae ob ae _ La Toe Lae i a 

ae a a | ” or iy be Ln eee « 

i dk eee Tr Ren er Fi Paar aa 
fad ean O bes TA Isat I = . 

Cece ee Se Nee Tater yees 

Oe mary ae 

























SANS tee, wera 
bedi She hee, Orie 
tae 





ee ery 









. 





a ee 






Srey 
Pw 





oe Fr etn me whoa 
ili Lott a ee 
led ae 





hh de oli Da | 
ae a 7) 


. 
are 





ek ee 
we eet ee * WSs tart y 
wroteon Cait beet hdL o Crd 
dled Mm Ua Py wer Ce 
att in pile Ret 
i 
ae roa 


lane an 
Tat ed 

CU Pers 
peed OL aol WP fr Paha 
Aad 


eh ee ee 
did to bl 
hat to 
aT 
a 








ib bets ee 
Saataada TET eS 


1 oe 

dd ath eae 

J Slee tae tt 

oF ete OR Mee 
Li al er 

ek ed 

Pot at 


Pr) 
Cr 
oT err 
i oe 
TOR aut ay, 
ae ats 4 


ARR oe ts 
adele aL Ria eee 


leat on Withee te 


Veeeae wee 
ae od 

es Foye 

Oo f wueye 
sa i eee ety 

La ey arr) ed 

ry ple ae 

ios or Ter 


a 
aT ld 





can 
eT ees 


Pood ‘ 

Se Pemuran a yt 

Chek e Re ea) ie Pett nie Lee Tt] 
oy = whe blah La 4 ms 

a pti Let 3 eC Tee hel he Se 

an) 7 Sat@ “Yo y 

+ iT 





Oe ee 
ai CL EY 
hada eater 
4 ew 8 





latte ha Lets 
wRe oh eete 









Ye avtat ca, , 





OO Bac an 
Lae ae 4 
Le) 


| 








ane atte 
er Lh 

ad 5 me ra 

boat Jae Wil ae Fer ae 

Che er er es ar) 

‘ 

Py 


ae 





be oY 
a) at 





Pa ae9 aa 
er mgs, 


m ow ate ., i 
bh mee ad 
PNCe mit een! oa 
+ A tf bale Lie 
i ere Se on Oe) 
sot a a er] 
or or) tell liata te Tol as 
tomas 
held Ln ee 
2 ie 





meta 
i Tee 
ON AES Sera te me 
pbk en ete 
att he oT) 








Se eee 
petael nT oh Tee 
lade hl ee Dek be ne 
aealdlinaee tb ae ee 


Letentet hee eee 2% Sor 
te 











Bake 
~ Peatiaay , 
etd tC) ot Oe 
4A eee wey el ee) 


OL ‘ 
Cd be | sit ue ew ¢ wea 
eh ee 9) ra aT) 
Lhe. a er iad. La ae tery 
ra ede a a) hel. re Pomne LT ri Cd © 
art ed bt dee Bh] 
Parr 








LTE Thy 
he Tt 
pdt ath BET 
. 


& a aad 
tte kas 





i dl i ad 
Cm 
a ee 
Ce oe ae 
eet mee 





a) 
ry ae nay 
oy a@ee 

Ae Sw or 6 wt gy 
ee] a a) 
Foot mae 


alae Sry 
ae a a ee 
tad 


ak TT) 
Ld Oe ee ty 





tress opis 
bide Colt tT rary OnE] 
Cree pale WI, We eer ae 
carey 8858 8G my ey NS SON er mR ah may, 
- beat e Mth UL ayer See 

tn abet tN roar ‘ 

hd a) 
Fee 
ws*e on ” 
bleh Tae ee 

a 


hh 
cr 






ad 












re 











oy rod 
Core 





i LST tT ee 
eh ebb ee Chol 
ar oto “eaten TT eee 
i aL TET eel diet ee Toes 
a Te oe Bete ay bead DT et 

bea LOLLY Serene a 
bet ed SO ba A Pee 
ee TL ate Rote am wae 
pend SS Sle ten a 
she was 


a 
*~e 


aa 











he) 
v¢ 

Leet Perey 

cas 

Ltd re ey 


eT re) e 
ees a 





a 
Ce et 

ora] ee 
ae 


7 ee 
oI 
n 
Semiee y 

5 


ary 


ied he Ed 





er) 
wy 





are Pin a 
ee my wee, 

‘mow = Se Cr ey 
er re Ty 





ay e 
od 


° 
see 
a oe 





Ct er es 
“ess 
ha en 
Prd 


oe 
a es 
a) 


hd 
be al Ld 


coi An) ak | ed 
eit be Tt ita. Se a re 
' weve Jel ae ee ers 
Lehedl tl > OS ecems 
arti ny be ae ET 9 
ry ome Tey 
Ce 


al | 
o Nerenas am ye 
phd el 
Ch Le ne are 
hla tet ad ih ht) 
ree * 0 8s whey 

* E80, one ny NPM eney Lt 
he ry a teh et ee | 
co ry aw) 


"me ee 4, 
* gtemeh bo ow of 






ed = 





aa 


LP 






er Lr es 
ea? oes 


me at 





ee. TY 
see a Ce ee 
a. ee TY 
rd 

ed 


aT ee) 
rd 


Cader ners 





iad a 
Cr Se Ld oF @ea eee ad aa 
ie te ey em. fi c aL ee a ee 
biel edt tLe ae) Lad fre alan St a | a 
be ot Te baienttel A dO ‘ 








ry 
a eT ay 





ry 





bites te lt TT 
eee) 


Lent tte ee 

a aan 

—~ ule « eee 

ae bia tel etd be 
pia Rate Let a 

Par ws mye 


Cer rs 
aa. rT 
ee 
eh It Oe Ps 
ial et ey bab ht et ee boii) 
SeAelig baie eC Me Lary Sh 
rd » bes Md rere om 





Sr) 





Cr ry 
We Ldee LTit Spes 
« ey hres eT 


he oe 
Po 





“wt een 
toll Sant) 


Le 


Saearry 
* a 


oe mene 
“ = #4, @ 

eee ad . ca 
ded Lol oe 

wwe a ae | 6 
ary ct aut dar ee) "es 
at aed a roar Ca) 
Cd 





hie Ok es 


coat a eee 
err 


Toa a) 
a ae oa $e ¢ mwa, 


a oP 
Lae Td 


rary 





Cuan 





seeha ty “= OT 

La | btadaieliis Bh See 

Behegotbeche Ltt Sete Phd bth aad 
a Te | lt he Ta ee a ee 
okt he 1 a 1 <? o- teh ae ed 

Clee ae Th a eT 
Fe | dia SO a - 


PN ae . 7 aed 
bet eon ae ore 





a Td 






he ord 
at) it ho a Sey ery ew Cn] a 
‘sitiateia te Maser. | Ca ar) pen ott EOS Meth oD tad 3 pe Cran 
cr ee Tom eth te eee] Pe hd Sat aE = ah er Soe a 
ae RE eas he a Bes bi gliiad Seed Ua rk Yes Renner ma es nr a Mein Co. Sm coe tame ba oe wre oe 80 
nen a es 1» N Sete ws, ei ae To Linh ed Poe ey LA be ere "a. beth aK Sec TT ea Te ht 

" teh bitte heh Le Se ' Sire ba a ye ar) re Le a Laer . ba een ee a 
re rR eo hahah et oT ae? bib Laat aay - 2 8 « e& 2s St Les “= Se be 8a bade he ok ey a Le rete - 

™ © vee * 8 wag a a . ar aL ee ae a er a °  eay ht) i a ks) 
tee ee bad . Ven we we ia gh ee a n Setena,y = we ore ee 
© Me 8 tee cd iy * ob Py be ia a a ey ° £ a ae) f ay es 
bi ee) iL Las a oa 5 Hh aha ° ena Be oe ay ek ay Pe 
belt te onl a eed tt od | ewos . Be ome Pateue 
wae we o rs EL die UO at Te eee 

*s a o a 
5 a ey =e Yao 
ry UG as a 
es » oes os 
r “4 - 

. he 
am 





Le ie 
re he ae tos ae 


oA a ere . 
ee Li rt aa Ee ie me 
Sn adden LY Fe ies 


Pa 
coh el reo ee 


bel Sah il 
“as 

Ln Ld 

eo. 





a oO 


a 


we 








oi 





7) 
oy hall a] aoa ol “ bi hk 
ha te ee Bee Lees AW aciae a’ 
wr aa a. oan a to es a eT) 
eR a ee Ca ee Le 
a re 





Sat 






Cary 








as 





eed 


o or 3 4.e 
ct ar “Fos 


o 
Ca ae 2 


ae 


a 
oa ' 








LAT) 





“es 
os Crcry 
« ke nw 
- o id Pea 
fs 


i eT 


La Te 
e 

. a etd) 

bald 


or es 

Lr J Li ae Fs dae a 

ar. Cy or a Py Py rT 
CO oye a oe Sry 

a Cr Ce) Fo « 

To e o Pa 

ee rs a 

| 

ra) 


a *“@ 
ari Pr 

can ke a ae 
Te) A 

od id a 





“+e 





Card ry 





DUDLEY KNOX LIBRARY —~ - 
NAVAL POSTGRADUATE SCHOOL ~ 
MONTEREY, CALIFORIIA 93948-8008 














NAVAL POSTGRADUATE SCHOOL 


Monterey, Galifornia 





THESIS 


TUB eS eRe tSTICALLY INTEGRATED RELIABILITY 
mort eGluUR i: Amb tAB ba ry 
ANALYSIS OF AN UNTRA-RELIABLE 

FAULT TOLERANT COMPUTER DESIGN 


by 


Ronald J. Nelson 
September 1986 


momesis Savisor: Wee DO toy 


SeprovecdmrOnesomtelic release; distribution is unlimited. 


ay e *y , ae x 
we y bore Sal 2 
=o 





ECURITY CLASSIFICATION OF THIS PAGE 
REPORT DOCUMENTATION PAGE 


a REPORT SECURITY CLASSIFICATION 1b RESTRICTIVE MARKINGS 
UNCLASSIFIED 
a SECURITY CLASSIFICATION AUTHORITY [3 OISTRIBUTION/ AVAILABILITY OF REPORT Approved for 


DuUDIewe reWegse; distribution is 


' F| / W } 1 
b DECLASSIFICATION / DOWNGRADING SCHEOULE iMeletina. t Ode 


} PERFORMING ORGANIZATION REPORT NUMBER(S) S MONITORING ORGANIZATION REPORT NUMAER(S) 


1a@ NANIE OF PERFORMING ORGANIZATION 6b OFFICE SYMBOL 
(if applicable) 


Naval Postgraduate School 62 


. ADDRESS (City, State, and ZIP Code) 


7a NAME OF MONITORING ORGANIZATION 


Naval Postgraduate School 
7b AOORESS (City, State, and ZIP Code) 

















Monterey, California 93943-5000 Monterey, California 93943-5000 









NAME OF FUNDING /SPONSORING 
ORGANIZATION 





8b OFFICE SYMBOL 


9 PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER 
(if applicable) ; 








lc ADDRESS (City, State, and ZIP Code) /10 SOURCE OF FUNDING NUMBERS 


PROGRAM PROJECT TASK WORK GNI™ 
ELEMENT NO NO NO ACCESSION NO 
1 TITLE (include pecue Classification) 


Meee SYNERGISTICALLY INTEGRATED RELIABILITY ARCHITECTURE: A RELIABILITY 
ANALYSIS OF AN UNTRA-RELIABLE FAULT TOLERANT COMPUTER DESIGN 


2 PERSONAL AUTHOR(S) 
13b TIME COVEREO 14 OATE OF REPORT (Year, Month, Day) 
FROM TO 1986 September 26 





















monald J. Nelson 


3a TYPE OF REPORT 
master's Thesis 


i SUPPLEMENTARY NOTATION 





15 PAGE COUNT 
Vor 


' COSAT! CODES 18 SUBJECT TERMS (Continue on reverse if necessary and identify by block number) 


FELO SU8-GROUP Fault Tolerance; Inversion Programming; 
Lae 


9 235°8AC™ (Continue on reverse if necessary and identify by block number) This thesis develops a Semi-Markov 
reliability model for the Synergistically Integrated Reliability (SRI) computer 
architecture. The SIR architecture is an advanced hybrid redundancy scheme that combines 
several current reliability techniques to achieve hardware and software reliability. 

These metnods include hybrid redundancy, N-Version programming and source congruent 

data interchange. The architecture is designed to support active control systems in the 
aircraft avionics industry as well as the bus controller requirements for the Dispersed 
Sensor Processor Mesh(DSPM) system for ultra-reliable computer communications. The 

Paper also develops high level algorithms for fault detection, location, and configuration 
manazement within the SIR system. 


$0 gh CLARA ISP PRL, AL Ta EI STS LOR a A RA ls A TL A 2) Ly Nd: ta RD, “1G, Se 












The reliability model integrates the hardware design, the hybrid redundancy philosophy, 
mad the operating constraints of an active control system into a single reliability 
model. Specific models are developed for the 3, 4, and 5 processor cases of the SIR i 
architecture and plots on the system reliability vs mission time are generated using the 7 
20 5S579'3UT. Gs AVAILABILITY OF ABSTRACT [21 ABSTRACT SECURITY CLASSIFICATION 

OC cxcuassieouNnumiteo K) same as ret  Cloric users | UNCLASSIFIED 
bea APE OF RESPONSIGLE INDIVIDUAL. ~~~ ~—~—~—~CS*™:~CS SDD TELEPHONE (Include Area Code) FFL 
mof Larry Abbott , aa Os pe OS OE G2 ih a ae a0 kl” a ara 
ID FORM 1473, a4 Mar B83 APR edition may de used un':lexnausted SECURITY CLASSIFICATION OF 74S PAGE 


Allother ed.t.ons are obsolete 


1 


Approved for Public Release; Distribution is Unlimited 
The Synergistically Integrated Reliability Architecture: 


A Reliability Analysis of an Ultra-Reliable 
Fault Tolerant Computer Design 


by 
Ronald J. Nelson 
Captain, United States Army 
B.5.E.E., Virginia Polytechnic and State University, 1977 


submitted in partial fulfillment of the 
requirements for the degree of 


MASTER OF SCIENCE IN ELECTRICAL ENGINEERING 
from the 


NAVAL POSTGRADUATE SCHOOL 
September 1986 


ABSTRACT 


This thesis develops a Semi-Markov reliability model for the 
Synergistically Integrated Reliability (SIR) computer architecture. The SIR 
architecture is an advanced hybrid redundancy scheme that combines several 
current reliability techniques to acheive hardware and software reliability. 
These methods include hybrid redundancy, N-Version programming and 
source congruent data interchange. The architecture is designed to support 
active control systems in the aircraft avionics industry as well as the bus 
controller requirements for the Dispersed Sensor Processor Mesh (DSPM) 
system for ultra-reliable computer communications. The paper also 
develops high level algorithrns for fault detection, location, and 
configuration management within the SIR system. 

The reliability model integrates the hardware design, the hyoria 
redundancy philosophy, and the operating constraints of an active control 
system into a single reliability model. Specific models are develoed for the 
5, 4, and 5 processor cases of the SIR architecture and plots of the system 
reliability vs mission time are generated using the SURE Reliability 


Analysis Program. 
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I. INTRODUCTION 


The role that computers play in controlling complex systems has 
‘increased dramatically with the advent of low cost microcomputers. 
Research into fault tolerant computing has also intensified due to the 
increased cost benefits available when microprocessors are used as 
redundant system elements. 

The combination of fault tolerance and complex system control in a 
microprocessor based system has made it possible to create cost effective, 
real time control systems for use in systems in which a failure could have 
life threatening results. Real time control systems allow design of complex 
systems at, or near, instability points. Designs of this type offer important 
economic and performance gains, but there is little tolerance available to 
account for fluctuations in the operational environment. 

Advanced avionics is a category of_applications where these concepts 
Can be used. Advanced avionics incompasses the application of real time 
active control technology to govern a variety of in-flight maneuvers that 
are designed to enhance cost and performance measures of airplanes. 

An example using these concepts is the digital fly-by-wire program 
(DFBW) being researched at the NASA Ames Research Center, Dryden Flight 
Research Facility. This program uses an F-8 aircraft that is modified to use 
active controls so that the flight of the airplane can be controlled by a 
digital computer. The airfrarne’s flight status is updated every 20 ms by a 
set of sensors with the information being supplied to a cornputer. The 


computer analyses the flight data and instructs a set of servo mechanisms 


te modify the flignt pattern to conform to a given set of flignt laws. The 


airframe is designed to be statically stable sa the maximum bounds on the 


length of a control cycle is on the-order of 200 ms. The ability of the fliant 
laws the handie possible flight situations is suspect if the upper bound of 
the control cycte is exceeded. (Refs. 1,2,3} 

Another program that is being studied at tne Dryden Research Faciiity 
uses an 4-29 airframe modified to reauce the static stability margin 


required to fly the aircraft effectively. The avionics control package 
designed to provide adequate flight control of the aircraft must rezct 
within a control cyelevihst is on the order of 20 10 30 ms in duration. 
Uncontrolled flight has the possibility of producing oscillatory airtrarme 
behavior with the amplitude of the oscillation doubiing every 190 ms. 
Failure to control instability of this type results in the breakup of tne 
aircraft within avery small time period. [Refs. 1,2,3] 

The benefits of design with close environment tolerances are very real 
for the aircraft industry. Active control technology applied to the avionics 
industry is 4 field tnat uses control systems to supply inputs cto the 
effector mechanisms that contro] the behavior of aircrart Clans 
(jnaependant of specific pilot direction) . Estimations on performance 


-—i 


1 = Paw a = “ ; j ar a4 an + l= [7 --~—7 -rmi- 
increases made possible by using active controis in statically is.adi- 


designs vary with the design choices made and the area where enhancement 
is desired Boeing Aircraft, ina study for the US. Air Force, concludea inate 
eee ee ae eeeesivie  S0eiINgG Also Oroject2a a Poisizie +3 


TT 


increase in Payload or a /% increase in aircraft range for a S21 aircralt 


Ref. 2 pp. IS] These figure 


ud 


certainly indicate that impierentation of 


active controls is desirable if the control system can reliably rnaintain a 
reasonable margin of operational safety. (Ref. 1] 

Aircraft safety is-an intensively requlated endeavor. The Federal 
Aviation Administration (FAA) currently requires reliability figures on 
aircraft in the range of 107? catastrophic failures per flight hour for 
flights with duration of up to 10 hours. Obviously, this reliability 
requirement would be applied to aircraft designed to the above 
specifications. | 

Achieving a system reliability of this magnitude is no trivial task. The 
system reliability depends on more than the computer itself. The reliability 
of a series of components degrades as the product of the cornponent 
reliabilities, even using identical cormponents as is shown in Figure 1-]. In 
order to meet the FAA requirements for system reliability, the systern 
components must all be ultrareliable. The components of such a systern are 


depicted in Figure |-2. 
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A great deal of research on this method of achieving system reliability 
has already Deen accomplished. Of concern to this thesis 1s work conducted 
at NASA's Dryden Flight Research Facility on the Dispersed Sensor Processor 
Mesh (DSPM) [Refs. 1,45] and current research at the Naval Postgraduate 
School (NPS) on fault tolerant computers [Refs. 6,7]. The DSPM is an ultra 
reliable communications network that is to be used in connecting ultra 
reliable sensor/effector sets with an ultra reliable computer. The DSM'P, 
based on work by Smith [Ref. 8], is an external communications network 
that monitors and controls the buses providing data to and from the ultra 
reliable computer and the sensor/effector sets [Refs. 1,45]. 

The current research being conducted by Dr Abbott at NPS concerns itself 
with the computer portion of the ultra reliable system. A propased 
architecture to satisfy the high reliability requirements is tne 


Synergistically Integrated Reliability (SIR) architecture [Ref.6]. A 


hardware design implementing this architecture was recently created at 
NPS by Captain Virgil Spurlock, US Army [Ref. 7] 

The SIR architecture is an advanced hybrid redundancy scheme that 
combines several of the most current reliability techniques to acheive 
hardware and software reliability. These methods include hybrid 
redundancy, N-version programming, and source congruent data interchanges 
[Refs. 6,9]. The basic architecture is displayed in Figure I!-3. Note that 
there is no single component that directs the actions of the redundant set of 
processors. Each active processor is totally independant, makes its own 
decisions on the correct course of action, and controls some portion of the 


external sensor/effector set. 


SIR ARCHITECTURE 


One port to the 
DSPM Network 


Figure 1-3 Sin architecture 





N-version programming, source congruency algorithms for data supplied 
to the SIR system, and the lack of a system. control point force added 
complexities in both the hardware implementation and the software that 
controls task flow in the system. A more thorough discussion of the 
implications of these techniques on the architecture implementations will 
be made in Chapters II and V. 

The purpose of the SIR architecture is to tolerate a number of faults 
while still providing results that are judged to be correct within a defined 
range of confidence. The triple modular redundancy (TMR) model, on which 
hybrid redundancy is based, specifically judges that a minimum acceptable 
confidence level can only be. obtained when at least two communicating 
components (processors) agree on the result of a test applied to system 
data values. The confidence level of correct system operation increases as 
more components can be used in the verification process. Ideally, a 
configuration control algorithm will modify the system configuration to one 
that provides the optimum confidence in system output given the occurance 
of one, or aseries, of specific faults. 

Modeling the reliability of such a system, or the Mean Time To Failure 
(MTTR) for the system taken as a whole, is of course very dependant on the 
reliability of the components of the system as well as the ability of the 
system to correctly identify both the occurrence of a fault and its precise 
location within the present system configuration. The system reliability is 
thus directly dependant on not only the cornponent hardware reliabilities, 
but also on an operating protocol that includes a system of fault tests anda 
systern of reconfiguration algorithms. The configuration control algorithm 


is one of a nurnber of algorithms that will detect, locate, and configure 


around system faults. The group of algorithms, taken as a whole, will be 
referred to as operating protocols thoughout the remainder of this paper. 

Additional complications are created by the real time nature of the 
control problern that is to be solved by the SIR computer system. All 
decisions on correct data values must be made within the context of a 20 to 
30 ms control cycle. This implies that any fault detection tests or system 
configuration management tasks must also conform to this stringent 
operational cycle. : 

The set of system configurations consisting of all possible combinations 
of system components in either good or faulty states is referred to as the 
set of system states. The size of the set of system states is 2", where 
there aren components in the system each with two states, good or faulty. 
The subset of all systern states that must be controlled by the redundancy 
management software must be small enough to allow operation within the 
control window. 

This thesis is concerned with developing a reliability model for the SIR | 
architecture that is appropriate to the environment in which it is to 
operate. A high level specification will be generated for the algorithms 
that performs fault detection, location, and configuration managernent 
tasks. A secondary goal of the thesis is to analyze the model with a 
semi-Markov analysis tool that was developed at NASA, Langley by Butler 
based on work by White and Lee [Refs. 10,11,12] A modification of this 
program to provide a graphical, event driven, user interface and allow it to 
operate on an IBM PC-AT microcomputer is being developed as a thesis at 
NPS by Major John Bordeaux, USMC. The reliability model generated by this 


thesis will serve as a test vehicle for the program conversion. 


IT. RELIABILITY ANALYSIS OF THE SIR NODE DESIGN 


The hardware of the individual SIR computer nodes must satisfy two 
equally important design constraints. The nodes must be as simple as 
possible while still meeting the computational requirements of the problern 
to be solved. The nodes must also operate with sufficient speed to complete 
the requisite calculations as well as complete any algorithms that detect 
errors and determine the correct course of action subsequent to error 
detection. An additional restraint on the node design is the requirement for - 
the hardware to support N-version programming. 

The SIR architecture is based on a variation of basic hybrid redundancy. 
Basic hybrid redundancy, shown in Figure 2-1, is an organizational scherne 
proposed by Siewiorek (Ref. 13] that achieves increased reliability by using 
redundant processors and a voting procedure to decide on a correct answer. 

Basic hybrid redundancy utilizes three on-line computer nodes to 
determine the correct system output value, with the remaining computers 
being either spare or failed. The rotary multiplexer controls which of the 
five cornputers are connected to the voter. The voter performs a bit by Dit 
comparison of the three data streams from the active computer nodes. The 
correct result is sent to the external interface. The voter rejects any 
active node values that do not match the other two on-line nodes. A status 
of the vote is returned to the rotary rnultiplexer for use in selecting which 


three processors frorn the total set will be active. 


Computer | 


Computer 2 
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Figure 2-1 Basic Hybrid Redundancy 


The N-version concept in software reliability was proposed by Chen and 
Avizienis [Ref. 14] and is similar to the basic hybrid redundancy technique 
for hardware. In this case, rnultiple versions of a computational function 
are written to the same software specification. The versions may be 
written in different languages or compiled with different compilers, but the 
effect proposed is to eliminate a class of software errors that are data 
driven. This theory states that it is unlikely that the same data driven 
programming error will surface in al) of the programs in exactly the same 
way. Of course this does not provide proof against design faults but the 
process could be extended to the software specification also. 

N-version programming imposes several constraints on the hybrid 
redundancy scheme for increased hardware reliability. The different 
software versions of the function being implernented will behave 


differently with respect to the roundoff and truncation errors that are 


inherent in digital computers. This will cause slight variations in the 
values produced by these routines. Slight variations are quite acceptable if 
the variations remain within some preset tolerance. If this small expected 
difference in values is to be tolerated then the voter in the hybrid 
redundancy scheme cannot be based on an identical match of data values. 

There will also be some srnall differences in the time when specific 
values are made available to the voters of the three active processors. This 
variation is due to the differences in the algorithms and the language 
efficiencies for different N-version programs that generate the data that 
will be tested by the vote process. Some method and the hardware to 
support it must be available to synchronize these time skewed values prior 
to the vote process. 

The basic hybrid redundancy system requires lock step sychronization 
and performs a bit by bit identity comparison on the data being voted. These 
requirements of basic hybrid redundancy do not support N-version 
programming. 7 

Regardless of the redundancy and fault tolerant strategy utilized in a 
fault tolerant system, all fault tolerant systems contain sections, called 
hardcores, that must work for the system to work. The voter and the rotary 
multiplexer form a hardcore for the basic hybrid redundancy scheme. Ine 
hardcore represents a single point of failure that could result in systern 
catastrophic failure. The voter can vote the wrong computer out; the rotary 
multiplexer can select the wrong computer node for the voter. 

The SIR architecture differs frorn basic hybrid redundancy in order to 
reduce the hardcore problems mentioned above. The system still relies on a 


triad of active cornputers for detection of error conditions and for deciding 


the correct value in the presence of an error. A DIOCK diagram depicting the 


design of a SIR node is shown in Figure 2-2. 
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Each of the cornputer nodes in the SIR system contains a voter and a 
rotary multiplexer. This allows a great deal more flexibility in 
configuration managernent. The strategy also-rernmoves the voter and 
multiplexer from hardcore status. The systern can tolerate faults not only 
in the host computers, but also in a multiplexer or voter and still continue 
to operate with a high degree of confidence. Recal) that the minirnurn 
confidence level requirernent of the systern is two correctly operational 
host computers and a communications link between thern. 

The design of the voter is another major difference in the basic hybrid 


redundancy scheme and the SIR architecture. SIR meets the requirements of 


N-version programming and the source congruency algorithms (discussed in 
Chapter V) by using a mid value voter. The mid value voter concept 
performs a bit by bit comparison of 3 values and returns those same values 
Sorted in value order (integer values). The mid value of this data triple is 
taken to be the most correct; it is also the value that will be supplied to the 
host computer for further processing or communication with SIR’s external 
interface (DSPM). In addition, the state of the voter process is given. The 
voter status register contains a maximum or minimum indication. If there 
is No minimum indication then the two smallest values are equal. If there is 
no maximum indication then the two largest values are equal. If neither 
indication is given then all three values are equal. The price paid by this 
increase in functionality is in increased complexity of the voters. 
Complexity increases equate to decreases in component reliability as will 
be discussed in section A of this chapter. 

The SIR interstage is designed to control the data exchange process 
between the voter and its sources of data. It is arather complicated artain 
because each node is designed to be completely hardware independant of the 
remaining SIR nodes. The independant clocks used in the SIR nodes are set 
to operate at the sarne rate, but there will obviously be some skew between 
them. To overcome this skew problem, data transfers between nodes 
require each node to send both data and a clock signal to the remaining 
interconnected nodes in the SIR system. 

A multiple clocking scherne is used in the interstage. Shift registers, 
controlled by the external node clock signals, are used to interface external 
data to the rernainder of the interstage which is controlled by the host clock 


Signal. A block diagram of the interstage is shown in Figure 2-3. 


The externally controlled shift registers (indicated in Figure 2-3 by 
primes) are interfaced with the internally controlled shift registers 
(unprimed) through a windowing process baSed on an expected time margin. 
The internally controlled registers load the values contained in the 
externally controlled registers at the completion of a count performed by 
the watchdog timer (WDT). An indication of receipt of a complete data word 
is obtained by using modulo 32 counters on the incoming external clock 
Signals. A bit in the slave status register is set when the the proper count 
is reached and the clock pulses being relayed to the primed registers are 
terminated. The WDT controls a bit in the slave status register in a like 
manner. 

The rotary multiplexer in the SIR node perforrms in much the same 
manner as that proposed by Siewiorek. The rotary multiplexer proposed by 
Siewiorek implements a particular redundancy management algorithm in 
hardware aS a portion of the design. The SIR architecture performs 
redundancy management within the host computer nodes. The complexity of 
the multiplexer is reduced in the SIR concept by performing the connection 
decision process in the host processor. 

The bidirectional nature of the SIR multiplexer is a complexity factor 
that offsets this advantage somewhat. The design shown in the figure 
shows only the data communications switch; an identical circuit is 
necessary to handle the clock signals. The SIR concept also enables a 
greater flexibility in the decision process. The basic hybrid redundancy | 
scheme irnposes a set algorithrn in the hardwired logic of the multiplexer. 
The SIR multiplexer is simply a switching network controlled by a set of 


flip flops forming a control register. The control register hardware is 
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independent of the algorithm that decides how it is set, so any algorithm 
could be used. Figure 2-4 shows the rotary multiplexer circuit for the case 
of a6 node SIR architecture. 

The rotary multiplexer is basically a 5 x 2 full duplex switch. The 
controls that deterrnine which 2 of the 5 external processors are connected 
to the host interstage are loaded into the flip flops that interface with the 
host computer (shown in Figure 2-4 as boxes). Any 2 path combination of 
connections between the two rotary multiplexer interfaces are possible by 
loading appropiate values into the host controlled flip flop register. 

Due to the hardware scheme described above, the SIR nodes do not 
require lock step synchronization in order for correct operation to take 
place. Not only can each node can have a unique clock associated with it, 
but the hardware also supports the design constraints imposed by N-version 
programming. The system can be said to be loosely coupled, with the degree 
of coupling being determined by the window size that the WDT imposes on 
the internode communication within the SIR system (which is variable by an 
instruction supplied by the host). In practice, the coupling will be 
comparatively tight due to the constraints of the application problem and 
the method of detecting faults. 

Now that the design of the SIR node has been developed, a systematic 
reliability analysis of the node design must be performed. Section II A 
describes the MIL-HDBK-217B reliability model. Section II B partitions the 
circuitry into appropriate subdivisions that share sirniliar reliability 
characteristics. The subdivisions will forrn the system cornponents that 
will be used in the system reliability model. The MIL-HNBK-21 7B reliability 


model is used to calculate the systern component reliabilities. 
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A. COMPONENT RELIABILITY MODEL 

A component fault model for circuit boards composed primarily of 
integrated chips has been developed by the US Department of Defense. This 
model is based on exhaustive testing, performed at Rome Air Force Base, NY, 
on a variety of chips from diverse manufacturers. The conditions of the 
tests were varied to account for expected environmental conditions in 
which the circuit boards may operate, as well as the complexity of the 
circuits that are implemented on the actual chips. (Ref. 15] 

This model, designated as MIL-HDBK-217B, was published in 1976 and 
covers several integrated circuit technologies including TTL, MOS, and ECL. 
The model predicts a printed circuit failure rate which is based on an 
exponential fault probability distribution for monolithic bipolar and MOS 


circuits and has a form shown below: 


ss 


A= MH Ta (C Nz +CoMe)tp  (failures/million hours) 


Experience during the testing process has shown that 90% or more of the 
faults that occur in printed circuit boards are due to integrated chips. The 
effects of the printed board itself and such components as resistors and 
capacitors on board reliability can then be neglected in design studies and 
are not included in the model. Because an exponential distribution of faults 
is assumed, the failure rate for an entire printed circuit board is the sum of 
the failure rates for the chip components that are used in the circuit (a 


series combination of system components). 
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The terms in the equation for failure rate each quantify the effects of 


distinct environmental factors. The ™ term concerns the “learning curve’ 


that is expected with new fabrication processes. The value of the term is 


set to | for established processes and 10 for new processes. Tg is a 


function of the amount of screening the chip receives from the manufacturer 


prior to its release. The model gives a range of values to this variable. 


Ty and The quantify the impact of environmental factors on the failure 


rate. The former is a function of temperature while the latter is a function 
of the mechanical stress (vibration and G forces) that can be expected in 
environments of interest to the military (flight being one of them). 


C, and C, are factors that quantify the reliability effects of gate 
complexity on a given chip (or the number of bits for memories). Tp is a 


function of the number of pins in the chip package. 


B. SIR SYSTEM COMPONENT RELIABILITIES 

The determination of what portions of the overall SIR architecture are 
classified as distinct components plays a central role in the algorithms that 
will manage the redundancy of the computer architecture. The 
classification scherne must follow a minimal set of rules if it is to be an 
effective tool in the redundancy management design process as well as the 
development of an accurate reliability model. 

The first rule is that the grouping of circuitry into components should 
follow functional relations. Division of a circuit into components that are 
below the functional level should be avoided. This is a logical approach 


because, for the redundancy management system to properly function, the 
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redundancy management system must be able to recognize the occurrence of 
a fault with a set of tests of a reasonable size and complexity. There are a 
number of techriiqu es that are available for diagnosing faults within 
circuitry, but it must be remembered that the SIR hardware is designed to 
operate in a real time control system. Tests, and decisions based on the 
outcorne of diagnostic tests, must conforrn to the limited duration contro: 
cycle of the real time application. The tests are also performed by the 
hardware itself, which increases the cornplexity of rnany of the diagnosis 
techniques. The underlying goal of the architecture is to be ultra reilabie, 
this implies that the hardware and software must be as Sirnple as possible 
while providing the required functionality. 

The second rule for classifying circuitry into cornponents is that the 
Classification. scherne should not create cornponents that cannot te 
effectively managed by the redundancy managernent system. The lack of 
this second rule would needlessly complicate the reliability model and 
perhaps lead to inaccuracies. The redundancy managernent algorithms would 
also become rnore complex for no purpose 

The redundancy raanagerment algorithms perform 3 main tasks. First, the 
redundancy management algorithms perform a set of tests on the system in 
order to identify any fault in the systern of components. A test set must be 
constructed so that faults in any of the components can be detected. Once 


the fault is discovered, a fault location process is used that is cornpoced af 


- 


nother set of tests. Once the occurrence of a fault is detected and located, 
tne redundancy rnanagement routines must decide on a configuration for the 
remaining good components (that satisfies the system requirement) in such 


a way that the selected components interface with the overall system input 
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and output ports and eliminate the faulty component from effecting system 
operation. 

The design of the SIR node must be analyzed keeping the rules outlined 
above in mind. Recalling the functional block diagrams shown in Figures 2-2 
through 2-4, there are several sections of a node that can fail. These 
functional failure modes are listed in Table 2-1. This grouping of circuits 
within the overall design is selected so that obvious functionalities will 
remain within a single cornponent. 

There are a variety of ways in which the host processor can fail. The 
design of the host processor is presumed to be of a generic form for the 
NS32016 microprocessor. This component category includes the 
microprocessor, its associated math coprocessor, the mass memory unit and 
associated memory chips, and the necessary glue chips necessary for. 
binding the components into a system. 

This is quite a large component category. The justification for grouping 
so large a set of subfunctions into one category is that the SIR architecture 
proposes no reliability enhancement using redundancy within this 
component. Some rnanagement of failures within this grouping of 
components is possible without using component redundancy (such as a 
memory chip), but these management techniques are based on software 
detection and correction algorithms. Of course, the correct execution of 
software is dependent on some portion of nonfaulty hardware, so the level 


of confidence of these fault rnanagement techniques is questionable. 
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TABLE 2-1 
SIR NODE FAILURE MSDES 


A. Processor Failure 
B. Interstage Failure 
|. Voter failure 
a. False three way equality indication (1 case) 
b. False two way equality indication (6 cases) 
c. False three way inequality indication (1 case) 
. Timer failure 
. Controller failure 
. Slave Status Register Failure 
C. Internet Communication Failure 
}. All links fail (Rotary Mux and/or InterStage(B’&C’) failure) 
2. Selected links fail (1 - 4) 
3. Single Interstage channel fails (B’ or C’) 


I. Gl NO 


The groupings of cornponents listed in the table within. the 
voter/interstage section of the node design are fairly obvious. There are of 
course a large number of ways in which single gate level faults can occur 
within any of these component categories. The result of any of these faults 
Is, however, the same; the cornponent can no longer satisfy the functional 
requirements for which it is designed. 

There is again no redundancy within the voter and interstage sections of 
the node design. Failure of any one of the components in these sections 
prevents correct detection of vote errors or the passing of that detection 
inforrnation to the connected host processor component. Recall that each of 
the nodes is independant and bases it’s decisions about fault detection, 


location, and recovery on the agreement of at least two of the three 
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connected processors. It is evident that any of the voter or interstage 
failures within the host processor would make this detection process either 
impossible or suspect. For this reason, the voter and interstage can be 
classified with the processor as a single component. 

The rotary multiplexer consists of a set of 2 paths (full duplex) that 
connect the host node (processor and voter/interstage) to the external SIR 
nodes. There are redundant paths inherent in the multiplexer design when 
spares are included in the basic SIR starting configuration. (Without these 
spares, there is no need of arotary multiplexer). 

The purpose of advanced hybrid redundancy is to allow the replacement 
of faulty components with good spare components. This process is 
physically achieved by managing the redundancy in the rotary multiplexer. 
The control word that resides in the flip flop set in the multiplexer is 
changed so that a new path or set of two paths are selected that connect the 
host with its external nodes. Failure of one of the paths not currently in use 
is not detectable and does not degrade the confidence level of the decisions 
currently being made within the host node even though the ability of the 
node to recover from a detectable fault has been reduced. (This assurnes 
that the failed circuitry does not effect the remaining circuitry by 
overloading the power supply or injecting noise into the system. The failure 
of a path will be assumed to be independant from the rest of the circuit, 
although this assumption may in fact not be true for all cases. The impact 
of any fault dependance should manifest itself in the an increased rate of 
failure for the remaining circuitry. Fault independance will be assumed in 
the model developed in this paper.) The TMR confidence level requirement is 


only that two connected nodes agree on state values. Therefore, on the 
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occurrence of a failure of one of the selected paths, a new path can be 
established with another external node without an unacceptable degradation 
in the confidence level. This of course presumes that the time interval 
needed to establish the new path is small enough that the probablility of 
failure of the remaining good path through the multiplexer is vanishingly 
small. 

The establishment of a good path is not solely dependent on one rotary 
multiplexer path. The link between two SIR nodes is terminated in rotary 
multiplexers at both ends of the link. The connection path between nodes 
therefore consists of the paths through two sets of rotary multiplexers and 
the physical connection between them. For the purposes of this thesis the 
effects of the physical link between nodes, with respect to overall link 
reliability, will be neglected (in line with the tenents of the 
MIL-HDBK-217B model). 

Analysis of the circuit used in the rotary multiplexer shows that a 
distinct subset of component gates are used in each of the paths through the 
rotary multiplexer. The subset of gates varies between the input and output 
paths through the multiplexer, with the larger subset controlling the input 
leg. The subsets for both the output and input legs of a link are shown In 
Figures 2-5 and 2-6 respectively. 

The larger set of gate components in the input leg of the path through the 
multiplexer is due to the 5 input OR gate. The OR gate does not isolate the 
output of the path (into the interstage) from stuck-at-1 faults that could 
arise from the other input paths. This inability to isolate the effects of 
faults in nonselected links causes a single path failure to propagate to one 


of the interstage input registers (B’ or C’) and effectively causes a failure in 


Zz 


the interstage. The whole node is thus in a failed state and management of 


the redundancy in the rotary multiplexer is not possible. 
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Interface To interstage 


Study of Figure 2-6 reveals that the critical components needed in the 
isolation of failed paths are the AND gates feeding into the S input OR gate. 
In order for the isolation to take place, the outputs of the nonselected AND 
gates must be a logical 0. The occurrence of a logic O at the output of the 
AND gate is dependent on the correct operation of the gate as well as a 
correct set of input values supplied to it. 

The SIR architecture is based on a system of cold spares. That is, there 
is a mechanism that controls the power being supplied to the nodes in the 
system. The means of powering up a new node and depowering a node 
determined as a failed node is controlled by the combination of the 
remaining two active nodes. A single node is unable to affect the power 
controlling rnechanism therefore a single point of failure cannot disrupt the 
power system. The circuit that performs this power control has not been 
designed as yet so it will not be included in the probability model that is 
being generated in this thesis. The irnplications of the mechanisrn to the 
SIR architecture, however, will be included. 

The result is that an unpowered spare presents a logic O to its rotary 
multiplexer outputs. This is fortunate in that the isolation of failed paths 
now relies only on the correct functioning of the AND gates that feed tne 5 
input OR gate highlighted in Figure 2-6, as well as the 2 flip flops that 
contro! the links to the remaining active nodes. 

Effective managernent of the redundancy in the rotary multiplexer 
requires that isolation of bad cornponents be possible. Since there exists a 
portion of the input path that cannot support this isolation, that portion 
must be grouped with the rest of the node for both reliability calculation 


and redundancy management purposes (the processor, voter, and interstage). 
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Therefore, the 5 input OR gate, and the 5 AND gates that feed it, will be ; 
considered a portion of the node. The AND gate that controls the loading of 
the flip flops must also be classified as a portion of the node. The 2 flip 
flops that control the inputs from the powered external nodes will be 
classified with the link cornponent. 

A link component will consist of the output path as shown in Figure 2-5 
and an additional 2 flip flops that contribute to the input path. Of course 
the link is terminated on 2 ends so the component list must be doubled. 
Each link rust also carry the clock signal for use in controlling the 
interstage B’ and C’ registers so the component count must be doubled again. 
All remaining rotary multiplexer gates will be grouped together with the 
host node for component reliability calculations. 

Tables D6 through DIO of Reference 16 contain a breakout of integrated 
chip failure rates calculated using pessimistic values for the pararneters 
contained in the MIL-HNBK-217B reliability model for printed circuit boards. 
Tables 2-2 through 2-5 use the data in the referenced tables to calculate 
the component and subcornponent failure rates of the SIR node. The gate 
inforrnation for the listed chips was extracted frorn Reference 1/7. The 
nodes computer is assurned to consist of a NS32016 microprocessor and 
N3532081 floating point coprocessor along with 64K of mernory. The failure 
rate information for the rnicroprocessor and coprocessor was extracted 


from References 18 and 19. 
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TRBLE 2-2 
INTERSTASE CONTROLLER HARDWARE REQUIREPIENTS 


Gates/ 


chip Chips Chip Description 


2 LS Quad D Flipflops 
L374 Dual D Flipflops 


275291 2K x 8 Prom 

L504 Hex Inverters 

LSI Quad 3-Input AND 
L530 8-Input NAND 
L520 Dual 4-Input NAND 
LS02 Quad 2--Input NOR 





TABLE 2-3 
VOTER BARDBWABE ALEQUIREMENTS 
Gates / . ; a. 
sae Chips Chip Description 
26 4 ES 7S Quad D Flipflops 
6 2 L504 Hex Inverters | 
ay ESoo 8-Input NAND 
Zz 14 LS20 Dual 4-Input NAND 
| 4 WS) 53 }3-Input NAND 
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TABLE 2-4 
LINK HARDWARE REQUIREMENTS 


Gates / | | eo 
hios a 1S Gimimigiens 
Chip (hips Chip Descriptian 
Ie 8 LS74 Dual D Fliptiops 
2 4 Loo 2-input AND 
a 8 aoe Dual 2-Input OR 
TABLE 2-5 


FAILURE RATE BY SUBCOMPGNENTS 
(IN FAILURES / 105 HOURS) 


A. Node component list Individual failure rate 
Hardcore 


Voter 515955 
interstage controller 555665 
Shift Registers (20 L5299) 17.8429 
Buffers (4 LS5245) 1.2469 
Slave Status Register (2L5125) 429)7 
Watch Dog Timer (4 LS163) 2.1706 
Modulo 32 Counters (2 LS161) 0.9878 
Rotary Mux Gates 1.3913 
Computer | 

N3532016 6.504] 
Ns3208 | 6.504] 
Memory 318.4119 

Total ~ =~594.2309 

B. Link Total co i 


Ill. System Environment 


Before developing the redundancy manangement protocol and a system 
reliability model, a more complete understanding of the control problern 
being solved by the SIR architecture is necessary. 

Recall that the design of the airframe is such that stability has been 
reduced to a marginal value. The airplane design allows conditions (a 
center of gravity aft of center of lift for pitch) that cause the airframe to 
cross the line between stable and instable operation. Catastrophic 
instability is avoided by using real time avionic controls to correct the 
flight pattern before the instability increases to a level that cannot be 
corrected. 

The environment in which the aircraft is flying is of course a rnajor 
factor in the rate of instability increase. NASA, Ames, has performed tests 
on airframe stability under a variety of conditions. The analysis showed 
that the X-29 aircraft, rnodified to the conditions described in Chapter I, 
would display an oscillatory instability pattern with the arnplitude doubling 
every |100 ms for environrnents adverse to the designed airfrarne 
characteristics. Breakup of the airfrarne may occur when the avionics 
controls are not used to reduce the stress being applied to the aircraft by 


that adverse environment. 
A. DSPM AND THE CONTROL PROBLEM 


A 20 ms control cycle is considered by many to be the acceptable 


frequency of applied controls [Refs. 1,2,20,21]. The cycle consists of 
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gathering flight state information from a set of sensors distributed about 
the airframe, performing a computation on the sensory data to determine 
the action necessary to bring the airfrarne within acceptable tolerances of 
airframe stress, and the distribution of commands to a set of effector 
servos that control the pattern of flight. A longer interval between receipt 
of effector commands is possible while still being able to recover stable 
flight operation but an upward bound of 200 ms is predicted for the 
maximum time that the F-8 can sustain a fault under critical flight 
conditions [Ref. 21]. 

The solution must meet the reliability requirements of both NASA and 
the FAA. The 10°79 per flight hour failure rate over a 10 hour mission time. 
is quite stringent. This stringent reliability requirement makes the problem 
significantly more difficult and means the reliability cannot be met by an ad 
hoc patch work but requires a systems approach to reliability. 

The avionics control cycle consists of gathering data (via sensors), 
performing calculations (via a computer), and exercising control operations 
(via effectors). A portion of the cycle that is not explicitly stated in this 
cycle is the communication of data to and from the computer (combining the 
sensors and effectors into one category). The system consists of three 
major components (as was shown in Figure 1-2). 

The ultra reliability of each of the major components of the system is 
achieved through the use of redundancy. A basic description of hybrid 
redundancy was given in Chapter II. An extension of this process can be 
applied to the sensor/effector component resulting in three sources of 
equivalent data that are made available to the computational element. A 


voting scheme is used at the effectors to determine the correct signal ina 


36 


manner similar to the approach proposed by Siewiorek [Refs. 1,13]. A 
midvalue vote process is not needed because further calculations are not 
necessary at the effector. The simpler, and therefore more reliable, hybrid 
redundancy hardware is sufficient. 

Not only is the communication path a possible source of data corruption, 
for sensory data as well as effector commands, but the link may also be 
physically damaged. A simple approach for achieving the communications 
element of the overall control system is to provide a direct connection 
between the computational element and the sensor/effector nodes. This is 
not desirable because one component failure (the bus) could destroy the 
whole control system. 

A bus scheme such as the one shown in Figure 3-1! uses redundancy to 
increase the reliability of the communications paths. While this approach 1s 
the commonly accepted method, it has some drawbacks. The concept is 
based on redundant components and not adaptability to possible system 
States. A single failure on any of the terminals on a bus can cause the 
entire bus to fail. 

An example is the case of a babbling node. In this case the whole bus 
effectively fails because bus control is destroyed. It is even possible for a 
failure of a single remote terminal to render the entire redundant bus 
system useless. [Ref. 1] 

The dispersed sensor processor mesh (DSPM) is a _ systern of 
communications links that is desianed to overcome the drawbacks of 
redundant bus schemes while avoiding the hardware overkill that 1s implied 
in a fully connected communication system. The system is discussed in 


length by Dr Abbott inRef. 1, so adetailed description of the DSPM system 
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will not be given in this paper. An overview of the system is necessary 
however, because it impacts the reliability model that will be generated for 


the SIR computer system. 


Rernote Terminals 


Bus 
Controllers 


Figure 3-1 Conventional ipproach to 
RedundantCommunications 





A typical DSPM network is shown in Figure 3-2. The essence of the 
reliability enhancement achieved by this system lies in two major system 
characteristics.. First, not all links in the system are in use. Active links, 
displayed in the figure as solid lines, carry actual information. The links 
displayed with dotted lines are inactive and carry no information. The 
active links in the DSPM network form a set of tree structures that 
originate at the bus controller and grow out to the furtherest link in the 
system of nodes. Each of the nodes in the systern control one or more 


sensors or effectors and are distributed throughout the airframe. 
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The second characteristic of the DSPM network is that network control 
is centrally located in the bus controller. The bus controller manages the 
network of links by three main algorithms; the growth, repair, and modify 
algorithms. | 

The growth algorithm is a network initiation algorithm that determines 
which links are used to form the trees shown in Figure 3-2. The growth 
algorithm is a breadth first growth process. Note that each node of the 
system has a number of links (full duplex). The nodes are directed by the 
bus controller to identify one of its ports as an inbound port through which 
the node will receive bus controller (BC) commands and through which the 
node will relay the BC commands through the other ports to the rest of the 
tree. Each tree in the network has a different bus controller port as its 
root. The direction of data flow and the state of each link in the networx is 
determined by the growth algorithm resident in the bus controller. A 
necessary system property is that there are no closed loops in the network. 
This is important to the algorithms that control] the DSPM network. (See 
Ref. | for details of why this is so.) 

When a fault is detected by the system, the repair algorithm 
circumvents the failed link or node by activating an inactive one to 
reconfigure around the fault. There can be a very large number of system 
configurations that generate an acceptable network structure given the 
occurrence of a fault in the system. If the repair algorithm encounters a 
second fault during the repair process, the links in the nodes are reset and 
the growth algorithm is used again. This retreat to the growth algorithm 
greatly decreases the complexity and processing requirements of the repair 


algorithm and is acceptable so long as the growth algorithm is of sufficient 
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speed that the control cycle canbe completed within the bounds of safe 
operation. (The growth algorithm must also have the means to accommodate 


failed link/node states in the growth process, which it does.) 


Figure 3-2 Dispersed Sensor 
Processing Mesh Approach 





The DSPM modify algorithm is a method for discovering failures in 
Inactive links before they can become a critical factor in a repair or growth 
process. Faults that occur in inactive links are not observable. The modify 
algorithm makes these latent faults observable by periodically exchanging 
the inactive links with their active counterparts while retaining the 
requisite connections between the affected node sensor/effector 


components. The algorithrn is designed to be distributed over rnany control 
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cycles. The process is continued at the. end of a number of control! cycles 
until all of the inactive links have been activated for some portion of the 
current modify cycle. At the end of the modify cycle the process is begun 
again. Detection of a “discovered” link fault will of course generate a repair 
task. 

The DSPM relies on the bus controller to make all network configuration 
decisions. Network configuration management requires a significant 
computational element and as such may be combined with the 
computational element required for the overall control system. Of course 
this requires that the SIR hardware and operating system software support 
the additonal task of controlling a complex communication network. The 
entire task cycle in the SIR computer must run within a 20 ms control cycle 
under fault free conditions. The SIR system must also be able to correctly 
respond to both internal as well as external faults within the upper bound of 


the control cycle. 


B. SYSTEM TASK STRUCTURE 

The operating system that is to manage the SIR task cycle can be 
Iimplernented as an event driven real time operating systern. The set of 
tasks that must be scheduled is shown in Figure 3-3. These tasks are 
scheduled as events by the operating system using a system of priorities to 
determine the next event that is to be executed. There are other tasks that 
the operating system must also schedule such aS memory management and 
1/O port control functions. These are not included in the set of tasks shown 


in Figure 3-3 because they are standard tasks in general purpose operating 
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systems and have no impact on the reliability mode} that will be generated 


for the SIR cornputer system. 
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Figure 3-3 Operating System Task Structure 





A base line scheduling structure is used to rnanage the tasks when the 
system is in an error free state. Because the SIR system is designed with a 
high reliability as a goal, it is expected that the majority of the control 


cycles will fall into this task execution pattern. Figure 3-4 graphically 
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displays the error free task execution flow as a continuous bar of tasks that 


conforms to the control problem to be executed by the SIR computer system. 
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Figure 3-4 System Task Flow 


There are of course errors that can be encountered due to hardware and 


software faults as discussed inChapter II. Anerror handling module is 
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shown that controls the response to encountered errors. (The algorithms for 
error detection and error state maintenence will be discussed in Chapter V.) 

Upon encountering an error in the execution of any of the main loop tasks, 
an error handling task is generated. The error handler is of a higher priority 
than the main execution loop so control is passed to this module after 
saving the necessary register set and temporary variables in the currently 
executing module. 

The error handler locates the error and generates a task or set of tasks 
to respond to the error appropriately. The priorities of the error correction 
tasks are adjusted for correct execution order. The error handler sets these 
priorities based on the severity of the error. The execution control may in 
fact be returned to the main loop routine that detected the error, with the 
error correcting tasks scheduled for execution after completion of the 
control cycle. In this case, the Modify task would not be implemented in the 
current control cycle, but delayed until completion of the next control cycle. 

As stated in Chapter II, the operating systems of the active SIR 
computer nodes are completely independant. All of the software written for 
the SIR computer must also be independant, or protocols must be designed to 
distribute systern information correctly among the processors. Each of the 
SIR processor nodes contains one of the BC ports into the DSPM network (as 
described in Chapter II). Because the DSPM management algorithms depend 
on the absense of any closed loops in the subtree structures (within a tree 


or among the bus controller and any combination of trees), there is definite 
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dependance among the software that resides in each of the SIR procesor 
nodes. 

The system information needed to manage the DSPM network consists of 
a set of tables that describe the DSPM network state, which links are 
active, inactive, or failed, and information on the tree structure that forms 
the communications paths from the bus controller to the sensor/effector 
nodes of the DSPM network. Each of the active SIR processors has 
Independant control over one of the trees, but state information must be 


global amona the processors if the DSPM algorithms are to function 
properly. 


It is fairly obvious that there must be a high degree of confidence in the 
Values of the state information that is passed between the SIR processor 
nodes. The method in which this information is passed must provide a 
degree of verification or any reliability models that describe the SIR 
system will be incomplete and inaccurate. 

The programs that will implement the OSPM algorithms will be coded as 
if the SIR system is a single processor computer. A mechanism that can be 
used to control the actual three processor environment is a Set of traps 
embedded in the operating system. These traps treat the section of memory 
that contains the DSPM states tables as a special memory category. When 
an update operation is performed on data within this section of memory, the 
trap routine communicates the update to the other active processors in the 
SIR processor network. System state table congruency between the 


processors is assured by this trap system. 
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IV. SYSTEM RELIABILITY 


The model for the complete sensor/DSPM/SIR system is quite complex 
and is best approached as a series of models, one for each major subsystem. 
The reliability of the DSPM system has been estimated as 197 '2 system 
failures per flight hour during a 10 hour flight [Ref. 1]. Achieving a similiar 
reliability figure is a goal of the SIR architecture. 

There are several levels that can be viewed in developing a reliability 
model for the SIR system. The major levels are the component reliability 
model and a systems reliability model for calculating the effect of the 
components taken as a system. The reliability model for the components in 
the system has been developed in Chapter il and a set of component 
reliabilities have been generated from this component level model. 

The SIR processor network is based on hybrid redundancy, which itself is 
based on a TMR operating environment. The purpose of the TMR model and its 
supporting architecture is to tolerate a number of faults while still 
providing results that are judged to be correct within a defined range of 
confidence. The IMR model specifically judges that a minimum acceptable 
confidence level can onty be obtained when at least two communicating 
components (processors) agree on the result of a test applied to system 
data values. The confidence level of correct system operation increases as 
rore components can be used in the verification process Ideally, the 
configuration control software will modify the system configuration to one 
that provides the optimum confidence in systern output given the occurrence 


Of one or more specific faults. Figure 4-1 graphically shows the basic TMR 
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system configuration from a logical cornponent level. The cornponerits of 
the system that are subject to faults (and fault management) are the 


individual processors and the system links connecting the processors 





ro -s t 1 P . - processor node 
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Figure 4-1 Basic TM& System 
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model atso requires a specification of the set of operating procedures that 


are necessary to manage all of the possible “qood" states. These operating 


PageccUrcs met ingiude a test set for each good system state such that 


Eech Go0Gc System state Can require a unique operational ort 
and test set since the cornbination of good cornponents in tre system wilt 


vary with fauit occurrences 


Effort has been expended to deterrnine how the reliability of the systern 
increases as spares are added to the basic TMR system [Refs 5,22,23] In 
order to achieve greater system reliability the spares are assumed to be in 
an unpowered state. A model for predicting the reliability of unpowerec 
spares is very difficult to develop. The difficulty arises due to being unable 
to test the component in an unpowered state. The process of powering up 
the component most likely introduces more stress on the component than 
the entire time the component is in the unpowered state As 4 resuli, 
periodically activating a major component like a computer node for fault 
testing is probably not reasonable. This thesis will not concern itself with 
the subtleties of rodeling the unpowered state. An assumption wili oe 
made that the component reliabilities of the unpowered components is a 
single order of magnitude less than that of the components in the powered 
state. 

The goal of the system as defined above is two fold. First, and most 
important, the goal is the operation of the systera in the presence of 4 
nurnber of faults. A second goal is operation in the configuration giving the 
most confidence in the resuits being generated by the at. 

What are the implications when spares are introduced to the basic Mx 
system? The standard system operation rernains the sarne: three active 
processors compare outputs to determine correct operation. The difference 


In reliability is that the nurnber of system links grows in a nonlinear 
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fashion as spares are added to the system ( n(n-13/2 where n is the number 


a8) 


(1 


Of Medes Im tine sveorem: 
Nonlinear growth in the number of systern links implies that the growth 


of system states is also non linear. Each of the system cornponents can oe 


modeled as being in one of two states - functional or nonfunctional. This 1s 
a simplification since there is also the possibility of improper cornponent 


functioning, but that will be ignored for the present. The nurriber of state 


yd 


possible given the starting configurations, is then 2, where n is the nurnber 


of system cornponents. An indication of the rate of growth in system states 


= 


is evident in Table 4-1. A systematic method is needed to identify the se 
of states necessary for a valid reliability model and determination of the 


necessary operating protocols and test sets. 





TABLE 4-1 
STATE PROGRESSION 


AS SPARES ARE ADDED TO BASIC The 


Configuration Processors Links Slates: 
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Basic TMR 3 3 pean o4 
THR+ | Spare — 4 6 210 = 1024 
TMR + 2 Spares 5 IC ee ee 
TMR + 3 Spares 6 15 2¢ 
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system operation always occurs with an active set of processors of two 
or three. This property of sytern operation allows the development of an 
Incremental system reliability model for the 3!iR processor network the 
basic model covers the three processor (odsle WIR) tase Est or ote 


incremented models, corresponding to the addition of spares to the sysiem, 


will degenerate to the basic TMR case after some combination of component 
faults in the overall system. 

Since there is no single control point in the system being modeled, the 
total number of possible system states is not needed to either calculate the 
number of distinct operating protocols required or to calculate the 
reliability of the system. Clearly, some aggregation of states into like 
configurations is possible (differing only by label changes of the nodes and 


appropiate connecting links). 


A. THE SEMI-MARKOV MODEL 

The Markov process model is a powerful tool for analyzing cornplex 
probabilistic systems. The central concepts of such models are states and 
State transitions. The states of a system have already been defined, but it 
should be pointed out that each state of the system represents all that must 
be known to describe the system at any instant. A second Key concept in 
this mode} ts the state transition. As time passes and faults are 
introduced, the system passes from state to state. These changes of state 
are called state transitions. Discrete-time models require all of the 
transitions to occur at fixed intervals and assign probabilities to each 
possible transition. For reliability models, the transitions represent failure 
occurrences and configuration functions (or repair functions for other than 
real time applications). (Ref. 16] 

The basic assumption of Markov models is that the probability of a given 
state transition depends only on the current state. The length of time spent 
in a state does not influence the probability distribution of the next state or 


the distribution of time remaining in the present state. This assumption is 
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rather strong but it fits naturally with the assumption that failure rates 
are constant. The constant failure rate assumption applies to the 
operational phase of component operation and results in an exponential 
distribution of arrival times of failures. The Weibull distribution, based on 
non-constant failure rates, apply to the burn in and wear out phases of 
component operation. The model developed in this paper will apply to the 
operational phase. 

The reliability mode! that will be developed for the SIR cornputer systern 
does not require the entire set of system states. There are several cases 
where the system can catastrophically failed before the system states are 
exhausted by the arrival of faults. These failure conditions arise due to the 
confidence requirernents of TMR systems. For example, failure of two of the 
active processor nodes without a correcting reconfiguration is a 
catastrophic condition, even if there are several spares in the system that 
are in a non-failed state. System failure states are referred to as death 
States; no transitions from death states are possible. 

The probability of entering a death state 1s precisely what is needed to 
determine the reliability of the systern. The calculation of the probability 
of entering the death state of a serni-Marxov rnodel requires the solution of 
a set of coupled differential equations. The large disparity between rates 
of fault arrivals and the rate of recovery (based on reconfiguration) usually 
leads to numerically stiff differential equations. This problem along with 
the high computational cost of solving large state space problems has led to 
the use of tools such as CARE Il] and HARP, and ARIES [Refs. 10,16] 

A tool that was recently developed at NASA, Langley is the semi-Markov 
Unreliability Range Evaluator (SURE) (Ref. 10]. The program is based on a 
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mathematical theorem developed by White [Ref. 11] that enables efficient 
computation of the death state probabilities. The technique provides a 
means of bounding the probability of entering a death state of a 
semi-Markov model using simple model parameters such as the means and 
variances of the state transitions. The advantage of the SURE technique is 
that the bounds are algebraic in form and thus computationally efficient. 
Because of this computational simplicity, very large models can be analyzed 
by the program. 

The modeling of highly reliable fault tolerant systems generally exhibits 
both slow and fast processes with respect to mission time. When these 
systems are modeled stochastically, some state transitions are many orders 
of magnitude faster than others. The slower transitions correspond to the 
arrival of faults in the system while the faster transitions represent the 
system response to the fault. Fault arrivals are modeled as exponentially 
distributed and are therefore time invarient with respect to the length of 
time that the systern resides in a specific state. System recovery 
transitions are generally not exponentially distributed and therefore the 
rates are time dependent. In order to preserve the semi-Markov nature of 
the system, the tirne since entering the current state is used to calculate 
the systern recovery time probability. Because the TMR system uses three 
way voting to mask a fault, there is a race between systern recovery and the 
occurrence of another fault. 

The state and state transition basis of the semi-Marxov model is 
represented very nicely by a directed graph. An example of a graph 


representation of a Markov process is shown in Figure 4-2. The states are 


represented by the labeled nodes in the figure while the state transitions 


are represented by the directed arcs. 


figure 4-2 Graphical Representation 
of a Semi-Markov Model 





The horizontal arcs correspond to the slow transitions of fault arrivals. 
These occur with exponential rate A, with the coefficients of A 
representing the nurnber of components that can fail. Vertical arcs 
represent fast transitions that correspond to system recovery. 

White's theorem is based on a graphical analysis of a semi-Markov model. 
The theorem calculates the bounds on the probability of traversing a 
specific path within a specific time. Applying the theorem to all of the 
possible paths of the model results in determination of the probability of 
the system reaching any death states bounded by a narrow interval. [Ref. 24] 

The SURE program will be used as a tool for determining the reliability 


of the SIR system model. The model developed in this thesis will conform 


=) 


to the graphical representation conventions that are described for Figure 
4-2. See References 10,11, and 24 for a more detailed discussion of White’s 


theorem. 


B. THE SINGLE FAULT ARRIVAL ASSUMPTION 

The SIR system of processors performs a self diagnosis to detect and 
locate faults within the active processor set. The reliability model of such 
a system can assume a variety of fault arrival conditions. The level of 
confidence in fault detection and correction in a discrete systern is 
dependent on the number of faults that can arrive during one interval. The 
confidence level will decrease as the number of simultaneous component 
faults increases. 

The reasoning behind this property is straight forward. The detection of 
a fault in the system becomes increasingly more difficult and the 
algorithms to perform this detection becomes more complex. An exarnple 
illustrates the point nicely. Suppose that a host processor (Processor A) in 
the basic [MR system receives avote after a data exchange which indicates 
an unacceptable inequaiity condition on one of the external active SIR 
processor nodes (Processor B). AN acceptable conclusion could be that the 
link connecting nodes A and B was corrupted by noise, or that processor node 
Bis faulty. But suppose that there is also the possibility that the Processor 
A’s voter could have malfunctioned. This poses a problem to the operating 
protocol in Processor A. If the processor assumes that the fault was due to 
the voter, it will take itself out of service. If there is the possibility that 
the voter is malfunctioning, then any tests received by processor A by way 


of its voter is also suspect. If the fault actually occurred in Processor B 
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then two of the three processors are effectively in a faulty state, and the 
system collapses. | 

The best case for system confidence is obviously for there to be only one 
fault occurrence during any one test cycle. Reliability based on the arrival 
of faults for single components is modeled as a exponentially distributed 
probability function and has the form shown in equation 4-1}. 


Ree At | (4-1) 


where 
R is the reliability 
A is the component failure rate 
t is the time interval since the last known 
good state was observed. 


The reliability of the TMR system is a multiplicative combination of the 
reliabilities of the TMR components. An equation for reliability for a fully 


operational TMR system is given in equation 4-2, 


Rg = (Rp)? x (RLY, (4-2) 
where 
Re is the system reliability, 
Rp is the reliability of a single processor node 
R, is the reliability of a single processor link 


Because the processors and links used in the IMR systern are identical, 
there is no reason for a unique labeling system, hense the terrns in equation 
4-2. The equation is a staternent of the probability of there being exactly 


zero component faults in the IMR system during a specified interval. This 


2) 


is one of a set of probabilities that cover all of the possible component 
fault states that the TMR system can be in during that interval. The 
occurrence of multiple faults within a single component is not germaine to 
the system reliability equation because a single fault is assumed to cause 
the incorrect operation of the component. 

Suppose that a hypothesis is made that states that there can only be a 
single component fault during the test interval. The reliability figure 
desired for the system is 10° IZ system failures per hour during a 10 hour 
flight. The probability of more than one component fault occurring during a 
test cycle should be at least an order of magnitude greater than the desired 
system reliability over the mission time. If it can be proven that this is the 
case, then the assumption that only one component fault can occur within 
one test cycle is valid. The operating protocols, given a single fault arrival 
assumption, will then be a great deal less complex as will the reliability 
model describing the SIR system. 

The probability that there will be exactly one faulty component within 
the test cycle is the summation of the probabilities that a unique cornponent 
Will fail and the remaining cornponents will not fail. In a system consisting 
of six components this requires six probability terms. An equation for the 
case of exactly one faulty component during a test interval is shown in 


equation 4-3. 


2 a Ea, : O52 ee 


where 
Poli) is the probability that i components will fail in time t 


(1 - Ry) is the probability that component k has | or more faults 
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The probability that two components will fail in the test interval is 
calculated in a similiar manner. The number of terms in the equation will 
be equal to the number of unique cases of components taken two at a time 


from the set of system components. Equation 4-4 shows this relation. 


- 7 2 5 e 2n_3 


+ O(1 - Rpt - Ry RpAR, 7 _ (4-4) 


substituting the values for component reliabilities developed in Chapter 
II into the equations for probability of faulty components during a test 
interval leads to the probabilities shown in Table 4-2. A test for faults is 
assumed to take place during every control cycle as depicted in Figure 3-4. 
The test interval used in the fault probability equations is the rnaximum 
tirne interval allowable for a control cycle for the F-8 digital fly-by-wire 


system. 
TABLE 4-2 
PROBABILITY FCA MULTIPLE SiR COMPONENT FRILUAES 
GTRIN MAAIMEM BOUNDS CF A CONTROL CYCLE 


Component failures within t probability 
0 8.97000 x 10° 
| 8.96999 x 10° 
2 2.69985 x 10°” 





lt is obvious that the probability of more than one component becoming 


faulty during a single test interval is insignificant when compared to the 
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overall reliability requirement for the system. The protocols and system 
reliability models that are developed in this paper assume that the fault 


arrivals are singular during a test interval. 
V. THREE PROCESSOR CASE 


This chapter will discuss the three processor case of the SIR 
architecture. Expansions to the three processor case. are possible using 
additional computer nodes as Spares, however all of these cases will 
effectively degenerate to some form of the three processor case by the 
Introduction of enough system component f ailures. 

The operation of the SIR system is based on the TMR principle. Systern 
Spares are in a cold, nonpowered state and can only be activated by actions 
taken by the other two active SIR processors in concert. These two systern 
characteristics cause the three processor case to be important to larger 
systems with spares, even before the introduction of cornponent failures. 
The SIR architecture uses an active processor set consisting of three 
processors; all the tests for fault detection and location and the decisions 
based on the outcomes of these tests are made by the active three processor 
set. The internode cornmunication links do not require a rotary multiplexer 
for the three processor case. The rnultiplexer hardware is, however, needed 
for all cases that contain spares, so the links (using the standard link 
hardware discussed in Chapter I1) will be included. The results Will apply 
to all cases regardless of the number of spares. 

There are several issues that apply to the three processor case of the 


SIR architecture. Each of these issues will be discussed in a separate 
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section. Section A will discuss the operation of data congruency in the 
three (good) processor case. Section B will develop a set of system states 
that meets the TMR requirernent for two connected, good processors. This 
set of system states will consist of acceptable “views” of a partially failed 
system that can still be managed. Section C will develop a high level 
communications protocol that is necessary for the data congruency 
operations discussed in section A. The protocol will also address the fault 
location process. Finally, section D will develop a semi-Markov rnodel of 
the three processor case based on aggregations of acceptable system states 


discussed in section B. 


A. DATA CONGRUENCY AND FAULT DETECTION 

The SIR architecture was developed to meet areal tirne control problern 
that has a specific cycle of events to process. The event cycle includes a 
series of data words that rnust be exchanged reliably between the active 
processors. These data words can be exchanged in two basic rnodes. The 
first mode requires an exact match of the value that each of the three 
active processors generate. This type of word will consists of systern state 
information as discussed in Chapter III (states of the DSPM system) A 
seperate case where bit invarient data exchange is necessary, is in the 
command words of the inter-SIR communications protocol discussed in 
section D of this chapter. The test for equality of the triad of data words 1s 
rade at each of the active processors by their respective voter elements 
The equality condition is observable to the host processor by viewing tne 


contents of the slave status register. 
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A second category of data word communication is also required. The 
sources of these words are the sensor inputs to the SIR system and the 
output values generated by the flight law calculations in the separate active 
processors (that are subsequently communicated by the DSPM to the 
effector servos in the aircraft). There is a large probability that data of 
this type will slightly vary in the low order bits as discussed in Chapter II. 
There are bounds on the amount of variation acceptable in the data however. 
Observability of the size of the variation between the minirnum and 
maximum values of the data triad is achieved by calculation in the node's 
host processor. The preset bounds, with respect to the particular word 
being voted, is applied to this calculated difference to determine whether 
any of the values deviate unacceptably. 

Both of these data transfer types will be used in each of the control 
cycles that were discussed in Chapter III. The cross communication and 
vote of these data words provide a cornprehensive test of the active 
components in the SIR system. The process of communication itself is a 
test of the links that connect the active processors; each active processor's 
internal components are thoroughly exercised during the execution of the 
flight law modules. 

Although exercising the functionality of a component Is not in itself a 
test for a fault, the verification process inherent in the voting process is 
indeed a test for a component fault. So long as two of the three processors 
agree on the outcome of a particular vote process, then the agreed upcn 
value Can be taken as valid with the confidence level of the overall system 
remaining at an acceptable level. (Of course a three way agreement 


generates a higher level of confidence. } 


60 


When an unacceptable comparison of the active processor generated data 
values occurs, of either the varient or invarient type, an indication of a 
component fault condition is established. The fault can be caused by quite a 
number of faults internal to the offending node, for example the node's 
voter, floating point processor, or memory. The fault could also be caused 
by the link connecting a pair of the active processors. 

The fault that has been detected by the unacceptable data comparison can 
be caused by either a permanent fault or a transient fault. Transient faults 
have many causes such as excessive electrical noise in the aircraft 
environment and marginal operation of some set of component circuits. The 
effects of transient faults have a low probability of occurring over an 
extended time so repeated the tests over a number of control cycles can 
eliminate the possibility (with high probability) that the error is a 
transient one. (Refs. 1,13] 

There is an external system characteristic that can cause a fault 
detectable by data disagreernent. In the ideal case the data supplied by the 
external sensors and transrnitted over the DSPM system will arrive as a 
data triad with one data word being supplied to each node. An unacceptable 
variation in the values of these data words could be caused by faults in tne 
sensor set or the DSPM systern. In either case, the SIR system would 
generate a data disagreernent detected fault. A preset test for correct 
operation of the SIR system is required to isolate the fault within the SIR 
systern or within the external systern when sensor data is being cross 
linked and voted. If the fault is isolated to the external systern (consisting 


of the sensor set or the DSPM system) then this information Is passed to 


DaPM redundancy management routines. (These routines will not be 
considered in this paper.) | 

1. Simplex Data Transfer 

Simplex data is propagated from one SIR node to the remaining two 
active nodes by an algorithm designed to insure congruency in the data 
between processors. Of course the transfer of simplex data must be made in 
an invarient manner because only one copy of the data is made available to 
the SIR system. Figure 5-1 displays p graphical representation of the 
Simplex data distribution algorithm. Data supplied to node A (either by 
external input or a DSPM state table update as a result of a DSPM algorithm) 
is loaded into each of its interstage registers. In the second phase of the 
transfer, the interstage registers (B and C shown in Figure 2-3) transfer 
their contents to the associated interstages of the remaining two active 
nodes. The value in the home register (A in Figure 2-3) remains the same. 

In the third Stage each of the interstages loads the received data into 
its register set complement. At this point all the interstage registers 
contain the original data value. In the forth stage, the contents of the 
interstages are again cross linked. At this point, each of the interstages 
have a copy of the data received by each of the remaining interstages. The 
interstages vote the data at this point. If the votes indicate a three way 
equality (observable by there being no min or max indication in the slave 
Status register) then the data transfer is correct. A maximum or minimum 
indication by the Slave status register indicates an error condition in the 


systern that must be located. 
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2. Three Value Data Transfer 

The procedure for exchanging and comparing the data generated by the 
external sensors (assuming that one sensor value of a data triad is supplied 
to each of the active node's external ports) follows a similiar approach. As 
shown in Figure 5-2, the exchange can be rade in a more concurrent manner 
in this case. The nodes are assured to be in loose synchronization which is 
justified by the relatively short duration of the contro] cycle being executed 
in the SIR system and the synchronization task that is the precursor to each 


control cycle execution (as shown in Figure 3-4). 


The data to be voted is supplied to each of the active node's 
interstages by either SIR external sensors or generated internally by the 
flight law calculation task. An additional source of data to be voted is the 
command words of the inter-SIR communication protocol discussed in 
section C. The data can be considered to be a triad of simplex data at the 
beginning of the exchange process. Each of the data values that are to be 
exchanged are independently generated. 

The process of exchange is begins by each of the nodes transfering 
their simplex data values to registers A, B, and C of their respective 
interstages. The process of loading a starting value in the interstages is 
concurrently performed as contrasted with the procedure shown in Figure 
oe 

After one exchange cycle among the interstages, each interstage 
contains the complete triad of simplex data values. A vote is then made on 
the data triad by each of the interstages. The procedure after the vote 
depends on the type of data transfer that is being rnade. If the data transfer 
is of the invarient type, ie the values rnust exactly match, then the slave 
Status register contains all of the information necessary to test for three 
way equality or, conversely, the indication of an error condition within the 
active SIR cornponent set. The value in register A after the vote can then be 
transfered to the node's host processor and execution can continue. 
(Register A will contain the rid value of the data triad as determined by the 


voter.) 
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Voting Through a $12 interstage (Three Valued) 





lf the data to be exchanged and voted upon is of the varient type then 
the contents of the A, B, and C registers must be transfered to the node's 
host processor. At the completion of the vote process register A contains 
the mid value while registers B and C contain the maximum and minirmurn 
values respectively. The host calculates the difference between the 
minimurn and maxirnurn values and cornpares the difference with {he 
maximum bound for the data word that is being cross linked and voted. If 
the difference exceeds the maximum bounds for the data word, then the 
error condition indicates either a failed component within the active set of 


SIR components or a failed component in the sensor/DSPM system. A preset 
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test must be performed if the data triad originated externally to the SIR 
system. An out of bounds condition on internally generated data indicates a 
SIR component failure. The cause of an internally generated failure can be 
attributed to either hardware or to one of the N-version programs that carry 
out the flight law calculations. Fault location algorithms must then be used 
to locate the faulty component and remove it from the set of components 
that interface with the systems external to the SIR architecture. 

The midvalue that is contained in register A at the completion of the 
vote will be taken as the correct value for further program execution. The 
differences between the midvalue and the minimum and maximum values is 
used to locate which of the two values caused the variation bound violation. 
The value that is closest to the midvalue is the value that is assumed to be 
fault free. The value associated with the larger difference is assumed to be 
the value caused by a component (or software) fault. Again, the fault could 
have been caused by the node, link, software, or an externally applied SiR 
input. 

3. Two Value Data [Transfer 

There is the possibility that only two values are made available to 
the SIR system. The data must be cross linked and voted in this case also. 
The condition can be caused in two separate ways. The first cause is a 
characteristic of the external systern that is used to supply the SIR 
architecture with input data. A sensor failure in the external sensor set 
that supplies the SIR input data would cause only two values to be presented 
to the active processor set. A DSPM system link failure could also isolate a 


sensor and prevent cornmunication of its data to the SIR system. 
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Internal SIR fault conditions cause the second category where two 
data values must be cross linked and voted. The removal of a SIR processor 
by either a processor failure, or a combination of SIR link failures that 
isolate a good processor, sets up a condition where the active processor set 
consists of only two processors. 

In the first case, an assumption is made that there is a full 
complement of active SIR processors and the appropiate links. The 
discovery of the external fault which causes the two valued data must be 
globally distributed to the active SIR processor set. With this global . 
Knowledge, the SIR processors are aware of which two processors will 
receive the data. The data is cross linked in the same way as in the sirnplex 
data case for each of the data values that is recieved by the SIR systern. At 
this point, each of the three active processors contains both of the 
externally supplied values. 

With only two values available, the concept of a midvalue vote is not 
appropiate. The values are instead transfered to the host cornputers in each 
of the nodes. The difference is taken between the two values and the 
variation bound test is applied to this difference. If the difference between 
the two values is within the variation bounds, then an average is taken in 
each processor. These averages are then cross linked arnong the processars 
by the three value cornmunication algorithm. The averages that are cross 
linked should be an exact match, because the node components are identical 
from both a hardware and software view. (The software for system contre! 
is not performed by use of N-version programming techniques.) If the two 
data values generate an out of bounds variation when they are compared, 


then reasonability tests are applied to the two values and the more 


reasonable of the two is selected. The confidence is less than is achieved 
with values that remain within the limits on deviation, but continued 
system operation is maintained. The reasonability tests are based on past 
values and the rates of change for the sensor type as well as implications 
that can be drawn from combinations of data supplied from other types of 
sensors. Some upper limit is necessary on the number of different sensors 
that can be in this reduced sensor set condition, but the upper limit and the 
level of confidence required will not be addressed in this paper. | 

lf the two valued case arises due to a fault that is internal to the SIR 
architecture, then a slight variation of the approach described above can be 
used. In this case there are only two communicating processors; the faulty 
processor has been powered down. The interstage registers that would have 
been used to connect the faulty processor are in a cleared state so the vote 
process still correctly functions in the remaining processors. (the vote 
instruction from the host processor to the interstage is the only means of 
generating an output from the Slave status register to the host. This 
restriction was designed into the interstage controller to simplify the 
controller design.) 

The two received values, one at each of the active processors, are 
processed as discussed for the case of two data values and three good 
processors. The only difference is that the minimum value is excluded from 


consideration in the final cross link and vote of the averages. 


B. SYOtEel] ST Alas 7 
There are 2" system states in systems that contain n components and 


where each cornponent can be in either a correctly operational state or a 
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faulty state. For the three processor case the nurnber of cornponents is 6 
when the links are included. This means that there are 64 system states for 
the three processor SIR architecture. This set of systern states includes all 
permutations of 6 cornponents where each component can be failed or good. 

The SIR architecture is operationally dependant on a TMR systern. This 
condition results in the complete set of 64 system states not being useful. 
Acceptable confidence in the correct system output in TMR systems can only 
be achieved when at least two of the three processors agree on the value to 
be output from the system. System states where there are rnore than one 
faulty processor can no longer provide even minimum confidence in the 
values being generated for output to the external system. Cornbinations of 
processor faults and link faults that isolate two fault free processors are 
also not useful; if the two good processors cannot cornmunicate, then there 
is no possibility of an explicit agreement on data values even if both 
processors are producing identical values. 

The SIR systern has another pertinent characteristic that effects the 
number of system states that are useful in redundancy managernent. There 
is no overall system centrolling processor or cornmponent. Decisions on the 
correctness of tne values that are supplied by the remaining active 
processors are made independently in each of the active processors. 

There will several system states that are identical in the number and 
type of cornponents that are faulty. The difference in these systern states 
IS apparent only in the labeling of the faulty cornmponents. If there 15 no 
overall systern that controls the redundancy managernent, then the label 
changes are meaningless to the reliability calculations. This ts true 


because the duplicated cornponents used in the SIR architecture have an 


69 


equal impact on the system if they fail and the probability of components of 
a like class failing is identical. A failure of processor A has no less and no 
more effect than the failure of processor B, if there are no other component 
failures in the system. 

The calculation of the number of “good” system states in a IMR systern, 
where good implies that the minimum TMR confidence level is rnet, is 
Straightforward. The binomial coefficient notation is a cornpact way of 
describing the nurnber of ways a subset of objects can be selected from a 
larger set. The notation for the binornial coefficient is shown in Figure 5-3. 
The binomial coefficient calculates the number of ways that a subset of 
objects can be selected from a larger global set of objects where the order 


of selection is unimportant. 


: XI 
See 


my 


Figure 5-3 Binomial Coefficient Notation 





For the purposes of modeling TMR system states, the X term in the 
binomial coefficient will represent the nurnber of components in the 
system. The Y terrn in the binomial coefficient will represent the number of 
faulty cornponents in the set of X components. 

A further breakdown is necessary to correctly represent the systern 


state aggregations that will be developed. Two fault free cornmunicating 


Zo 


processors is the condition required for cantinued operation of a TMR 
system. There are two component types in the system (the nodes and the 
communications links), so a complete representation of possible fault 
combinations must take the two cornponent nature of the system into 
account. This is easily accornmodated by the binomial coefficient notation 
by simply treating each component class separately and multiplying the two 
resulting values. The notation convention used will place the processor 
term first in the processor/link multiplicative pair. 

Figure 5-4 shows the resulting equation for calculating the number of 
system states where continued TMR operation is possible, where the first 
factor of each term is the number of ways to choose fault-free processors 
and the second factor is the number of ways to choose fault-free links. 
These states are referred to as good states, although the confidence level 


varies among the states groups this set categorizes. 
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A graphical representation of the fault classes that are sustainable by 
the TMR system is shown in Figure 5-5. The figure shows components that 
are faulty by using highlighted lines. Each fault class shows the 
corresponding terrn from Figure 5-4 to indicate the number of system states 
that can exist for that particular fault class (identical but for label 
changes). 

A larger aggregation of fault classes can be made after studying the 
fault classes shown in Figure 5-5. The classes shown in the Figure 5-5c 
through S-Sf have a. important characteristic in common; one of the 
processors in the active set of processors is either faulty, or isolated from 
the remaining active processors by faulty cornrnunications links or .a 
combination of both. Any of these four fault classes will of necessity be 
treated in the same fashion; the two remaining, communicating processors 
will assume that the third processor is faulty and proceed accordingly. 

There will be a nurnber of death states in the TMR system. The death 
States are composed of fault classes that do not allow the rninimum TMR 
confidence level to be rnet. The number of death states is calculated in the 
Same manner as for the number of good system states. A nalve assumption 
can be rnade that the nurnber of death states is simply the nurnber of states 
remaining after subtracting the good states frorn the total number of states. 
This is not the case. The definition of a death state is one in which there 
are no further transitions possible. Implied in this staternent is that there 
is some starting state for the systern, ideally a fauit free state. The 
assumption for only single fault arrivals in the SIR systern has already been 
justified in Chapter IV. The cornbination of these two characteristics 


results in areduced set of death states. 


Figures 5-6 and 5-7 show the calculation of the number of death states 
possible in a TMR system and a graphical representation of the fault classes 
for system death states respectively. Each of the fault classes that are 
depicted in Figures 5-6 and 5-7 correspond to a fault class that is entered 
by a single fault arrival. 

The system states remaining after the death states and the good states 
are subtracted from the total number of system states are classified as 
impossible system states for a IMR system. The reasoning for this 
Classification is a result of the TMR characteristics together with the 
Single fault arrival characteristic of the SIR system. All of the impossible 
State fault classes can only be achieved by way of a death state. By 
definition, there are no transitions allowed leaving a death state. Ihe 
impossible states simply cannot be reached. (This assumes that the system 
Shuts itself down when a death state occurs.) 

The enumeration of the impossible states for the SIR (TMR) system is 
shown in Figures 5-8 and 5-9. These states can not be reached in the SIR 
system, so they will not be included in any configuration management 
algorithms or in the semi-Markov reliability model that will be generated in 


section D. 
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C. INTER-SIR COMMUNCATIONS AND 
FAULT LOCATION PROTOCOLS 


The cross communication of data values between the active set of SIR 
nodes is necessarily dependant on a rigidly defined order of events and 
reactions to the outcomes of those events. This rigid operating procedure, 
called a communications protocol, is a straight forward method of 
exchanging Information reliably between the node processors. Section.A of 
this chapter described the method in which data is exchanged and voted in 
order to assure a reliable, congruent data exchange. 

The TMR operating environment must also be able to react correctly to a 
fault arrival. Correct reaction to a fault means that the system must 
reconfigure the system cornponents such that the set of active core 
components comprises, as close as possible, a complete set of qood 
components. Correct reaction to a fault arrival requires that the fault not 
only be detected, but also be located. The algorithms for cross linking data 
values between the active processor set is adequate to detect a faulty 
component, however using these algorithrns alone, the fault can only be 
located to a possible set of components consisting of the node from whicn 
the offending value arrived, the link over which it arrived, and the external 
input system (for the case of system input data). 

The knowledge of a fault occurrence must also be tranmitted to all of the 
active nodes. It is quite possible that the fault can be detected at only one 
of the nodes in the active processor set. For example, this coulda happen by a 
failure of the control signal that enables the shift out operation (shift 
right) on the B or C registers in one of the systern’s interstages (re, Figure 


2-3). |If the fallure occurs in node A and the failure effects the path to node 
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B, then node B is the only node in the system that can detect the error. As 
an example, consider the case where processor A operates correctly, other 
than the inability to enable the shift right instruction of the register 
connected to node B. Node C will not be able to detect the fault because the 
path to node A is functioning correctly. Processor A will not be able to 
detect the fault because the shift right operation of the offending shift 
register does not effect either the node's voter or the ability to recieve a 
data word from node 8 correctly. 

Ina TMR system, at least two cornmunicating processors must agree on 
the result before confidence is placed in the result. This requirement also 
applies to fault detection. For the example above, if node B does not obtain 
corroboration of the detection of the fault in processor A (or the link 
between nodes A and B), then the consensus of the active processor set will 
be that node B is in error, even though node A is actually the faulty node. 

A system for the exchange of fault detection rnust therefore be included 
in the communications protocol. Because the occurrence of a fault in the 
system is not the expected result (the system uses relatively reliable 
components), there must be 4 preset, recognizable method that explicitly 
States the intent to communicate a fault arrival rather than the next data 
word to be cornrnunicated. The preset signals tnat are used to indicate 
conditions within the SIR active processor set will be referred to as 
Inter-SIR protocol cornrnand words, or just coramand words. 

There are two methods that are generally used to differentiate the 
command words and the data words. The first is to send the word in a 
message format where there is a preset sequence of words expected; for 


example, a cornmrmand word followed by a data word. If enough information 
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needs to be transmitted with the data, then more than one command word 
can be sent in the message, again with the words in a preset sequence. The 
second method that can be used is to modify a frame synch bit, as in the 
case of the MIL-S1D-1553 protocol. When a synch bit is available for use in 
this manner, there is rnore latitude available in the order of the words sent. 
A reduction in the total transmitted volume of information can be achieved 
in this manner. Each data word does not necessarily need a command word 
preceding it; expected data traffic can procede in an ordered manner, and if 
errors are detected, then a cornrmand word can be inserted into the data 
stream and be recognized as such by the difference in the synch bit. 

The SIR architecture does not use synch bits in the transfer of data 
between it’s internal nodes. The former method must therefore be used so 
that there is no possibility that data can be misinterpreted as command 
words and vice versa. 

The structure of the command words must be designed so that the 
originator of a cornmand word is included in the word. This knowledge is 
needed in the activation of an offline spare to insure that the active 
processors are all aware of the actual configuration of active nodes. {he 
Set of command words must also be large enough to indicate all of the 
possible faults that can be generated by across link process. The indication 
of a fault can be cornrnunicated among processors by sending the status of a 
vote. In this case the cornrnand word structure must be robust enougn to 
include all possibilities of results that can be generated. The indication of 
an out of bounds condition is not sufficient; the processor that deviated 


frorn the allowable bound must be indicated in the coramand word. 
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The detailed design of the command word structure will not be addressed 
in this paper. The high level protocols developed will assume an appropiate 
command word structure for the task. The discussion above is included for 

completeness of the discussion on the communications protocols. 

The node processor acts as the controlling element of a fully 
controllable n x 2 full duplex switch, where the switch is actually the 
node's rotary multiplexer (n is the nurnber of remaining processors in the 
given SIR architecture). Control of the paths selected to connect the rotary 
multiplexer input ports with the multiplexer output ports is established by 
the control word supplied to the rotary multiplexer by the node's host 
processor. The process of changing the routing of the paths through the 
multiplexer also changes the composition of link subcomponents that 
comprises a particular system link. The single fault arrival assumption 
assures that if the component that has actually failed is the link, then only 
one terminal end of the link has failed. Although the subcornponent 
compositions of the links are now different, each new path is an 
independent circuit once the control word has been loaded into the set of 
rotary multiplexer flip flops (as discussed in Chapter It B). 

The notation shown in Figure 5-10 will make the following discussion 
easier to follow. Each of the nodes contains 4 parts of a communications 
path: two are incoming simplex paths and two are outgoing simplex paths. 
These paths are indicated as | and 2? in the Figure 5-10, with the direction 
indicated by the subscript i for input and o for output. The paths will 
always be manipulated as pairs. For example, a node's number 2 path will be 


directed to the same external node for both the input and output subscripts. 


Figure 5-10 Basic SIR Sustem 





Using the notation introduced by Figure 5-10, a link is composed of 2 
pairs of components. For example, the link between nodes A and B in the 
figure are represented by the cornponent pair B1A2 and the link between the 
nodes A and C is represented by the cornponent pair A1ICZ2. 

The cornmunication requirement between processor pairs consists of a 
cornmunications path in both directions. Each path has hardware 
components located at the terrninal ends of the path, however all of the link 
components must be operational for the link between the connected nodes to 
be classified as good. No inconsistency with the concept of a Jink 
component being the combination of the hardware at both ends of the link js 
introduced by the additional inforrnation shown in Figure 5-10. 

Because the rotary multiplexer is fully controllable, each of the selected 
paths through the rotary multiplexer can be associated with either of the 
interstage cormmunications registers. The B interstage register pair 
(consisting of registers B and B’ in Figure 2-3) of node A can be routed to 


either node B or node C. 
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The ability to switch the interstage register pair associated with a link 
path allows isolation of the fault in the communications path (consisting of 
the interstage and the link) to either the interstage or the link itself, 
thereby providing fault location. If the fault is caused by the inability of 
node A’s interstage register (B) to shift right, as in the example, then 
changing the external node that is connected with the 6 register to the 
other active node will provide that second node with an indication of a 
communications fault. Now two of the three nodes have received an 
indication of a communiations fault associated with node A, and the fault is 
isolated to the interstage of node A 

The approach of switching the external nodes connected to a particular 
interstage register pair is a sufficient location test for node faults that 
effect only the interface with the rotary multiplexer. The same method also 
correctly identifies a node that fails to correctly input communicated data 
into it’s primed interstage registers. 

An example is the best way to show that the path switching method can 
uniquely locate a fault under the single fault arrival assumption. The 
results of a cross link of a data word can manifest itself in three basic 
ways. The first case is for two nodes to indicate the same fault. Under a 
single fault arrival assurnption, this leads to the unique conclusion that the 
Indicated node is bad. A second case is for all three nodes to agree on the 
vote outcorne, in which case the voting process is assurned correct. If the 
vote outcorne is that all data values are correct, then all system 
cornponents are considered fault free and the system proceeds. If the vate 
outcome indicates a particular word is bad (not equal or out of range) then 


two cases can apply. lf the word is internally generated (a command word 
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or a data word produced by the flight law calculations), then the node that 
produced the erring word must be bad. If the word is produced by a source 
external to the SIR architecture (DSPM) then a preset test word must be 
used in accross link test. If each voter status agrees the outcome of this 
test (using invarient data transfer) then the fault is external to the SIR and 
the DSPM redundancy management routines are notified of the fault. 

The third case applies if the results of the vote test do not agree (for the 
preset word test or for the original vote outcome). A series of additional 
tests must be performed to isolate the fault. If two nodes indicate the 
same node as faulty, then by the single fault arrival assumption, the 
indicated node must be faulty. If only one node has indicated a fault then 
more tests must be implernented. (If the node indicates that both the other 
nodes are faulty, then the single fault arrival assumption implies that the 
node which indicates the faults is bad.) 

Figure 5-11 depicts the possible test outcomes for a set of three tests 
(a, b, and c). Test outcome (a) in Figure 5-11 shows a possible senario 
where only one node indicates a fault. (This test outcome is generated by 
the data cross link that supplied the initial fault detection.) The rows in 
the figure indicate the test outcomes of each of the nodes with the colurnns 
indicating the nodes on which the test was performed. A good test outcome 
is represented by a O and a faulty test outcome is represented by a x. The 
possible faults that could cause these outcomes are also indicated in the 
figure. There are several possible faults that could generate the tesi 
outcome shown in Figure 5-|lla. A second test is needed to isolate the 
actual fault condition that caused the test outcome shown in Ei@uULe: aaatit a 


A logical second test switches the paths through node C’s rotary multiplexer 
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and repeats the data cross link which initially generated the test outcome 
shown in Figure 5~11a. The possible outcomes of the second test are shown 
as test outcomes b.1 through b.3 in Figure 5-11. (The single fault arrival 
assumption limits the possible outcomes.) Unique fault locations are 
provided for each of the test outcomes except for test outcome b.2 in Figure 
S-!1 In this case the fault can't be isolated with just this test. A third 
test is needed. The logical approach is to exchange the paths through the 
rotary multiplexer of node A and cross link and vote again. The possible 
outcomes of this new test is indicated in test outcomes c.1 through c.2 in 
Figure 5-11. The outcomes of this second test are unambiguous under a 
single fault arrival assumption so no further test are necessary. The fault 
has been isolated. The set of faults listed as the possible causes of the test 
outcome shown in Figure S-tic.l is a simplification. The test outcome 
isolates the fault to the outgoing end of a link that resides at processor A. 
The failure could also be in the interstage of processor A, however. If that 
is in fact the case, then node A is actually faulty and not the link. The test 
isolates the fault to that single outbound path from node A however, and 
verifies that the remaining path out of node A is functioning correctly. For 
the three processor case, the fault has been isolated to a sufficient degree 
and no other test is necessary. For the case of the SIR architecture with 
Spares, an indication should be registered that either the outbound path 
through the interstage in node A is faulty or that the outbound link from 
node A is faulty. In either case the node should be replaced with a 
completely fault free spare (with requisite links). The node is not however 
eliminated from consideration for possible use in a future active node 


configuration. 
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The two cases of link failure (the input half of a link or the output half 
of a link) can actually be thought of a single case because the link must be 
full duplex for correct operation. Failure of both simplex paths associated 
with a single duplex link is not consistant with the single fault arrival 
assumption that was validated in Chapter IV, unless the fault lies in a 
node's hardware or software and not the node's link hardware. This will 
happen in two separate cases: the node processor supplies an incorrectly 
ordered set of bits to the set of flip flops in the rotary multiplexer; or the 
failure of the AND gate that controls the loading of new values into the 
rotary multiplexer flip flops. In either case, the nodes that are connected 
. by the rotary multiplexer have no opportunity to agree on data values. The 
ideal TMR operation is no longer possible and an alternative is required. 

While ideal TMR operation is no longer possible in the nodes that are 
connected by the faulty link, the remaining active processor can still 
perform an ideal TMR operation. If all three processors remain in service, 
then the node not directly affected by the faulty link has become a single 
point of failure for the systern. It must relay data words between the two 
processors that have the faulty link in common. Neither of the rernaining 
links is a single point of error for the system, nor are the nodes that have 
the failed link in common. A single failure of either of the remaining gocd 
links results in the isolation of a processor, but there are still two 
communicating processors in the systern and, as discussed tn section A.2, 
operation can continue. 

Figure 5-12 shows a flowchart of the comrnunications and fault location 
protocol. The operational rnodes for cross linking data under varying fault 


classes, discussed in section A of this chapter, are referred to in the 
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D. RELIABILITY MODEL 

All of the necessary information has now been developed for applying the 
Semi-Markov model to the three processor case of the SIR architecture. The 
model will be shown graphically as discussed in Chapter IV, and is shown in 
PiGihe sa: 
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Figure 5-13 (a) Nellanity Model for the 
Three Procassar SIR Rrehitactura (completa) 
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Figure 5-13 (b) Reliability Model for the 
Three Processor SIf Architecture 


Note that there are two seperate As that are used to indicate the arrival 
__of a fault. (Recall that the arrival of a fault corresponds to a-state 


transition in the semi-Markov model.) Ap will denote the arrival of a 
processor fault, while Ay will denote the arrival of a link fault. 


There are no system recovery transitions possible for the three 
processor case of the SIR architecture. This is because there are no spares 
associated with the system. The system is tolerant of faults, with the 
number of faults that can be tolerated being a function of the order In whicn 
the faults arrive. Therefore, use of the notation descibed in Chapter IV 
corresponds toa linear state transition grapn. 

The notation that is used in the reliability model refers to the fault 


classes that were described in section B of this chapter. For example, the 


95 


fault free category is notationally listed as state 5.5(a). The notation 
refers to Figure 5.5 (a) to describe the fault classification that corresponds 
to the state. 

A good deal of state aggregation is used in the reliability model shown in 
Figure 5-13. The death state categories that are shown in Figure 5-7 are 
unique classifications of faults, however for the purposes of the model, the 
Classification of subcategories of a death state is not important. 
Therefore, the figure shows a single death state. ‘The path that is followed 
in the figure to arrive at the death state identifies the fault class of the 
death state. 

Figure 5-13a shows the complete set of good states for the three 
processor case and the possible transitions due to fault arrivals. The states 
that are shown enclosed within the dotted box are effectively identical with 
respect to system states that follow and with respect to the 
communications and operational algorithms that are used In each of the 
States. A super state aggregation can therefore be made and no reliability 
Information will be lost. 

The transitions that are made inside the dotted box are included only to 
show a complete exhaustion of system components. The algorithm that 
detects and locates a fauit would not actually allow the transitions internal 
to the dotted box to take place. [his is because when a processor is 
determined to be faulty the fault is considered to be a permanent fault and 
the processor is powered down. Because the fault decision is made by the 
two-nodes that arent faulty (or isolated) the actual condition of the node 
that is voted as faulty is not relevant. The single fault arrival assumption 


assures that the correct node is powered down because both of the nodes 
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that decide the fault cannot be faulty. The states shown as (c), (d), (e), and 
(f) in Figure 5-5 are therefore actually one distinct state as viewed by the 
redundancy management algorithm. This is shown in Figure 5-13b where the 
State shown aS 5.5(C) represents all of the states within the dotted box in 
Figure 5-1 3a. 

Rather than show a seperate transition for the link and node fault 
arrivals, a notation is used that is more compact notationally, while losing 
none of the actual reliability information. If two transitions leaving a 
particular state both arrive at the same next state, then the probability of 
transitioning between the two states is just the summation of the two 
transitional probabilities. The collapsed notation helps keep the mode! 


diagram from becoming cluttered unnecessarily. 
VI. THE SIR ARCHITECTURE WITH SPARES 


The SIR architecture is obviously designed with the operation of Spares 
in mind. The addition of the rotary tnultiplexer is made for just this 
purpose. The development of the three processor case was necessary 
because in all of the cases of the SIR architecture that utilize spares, the 
active set of processors remains a three processor core. The method in 
which the systern detects faults 1s identical whether the spares are 
included or not. The introduction of faults to the system over tirne 
eventually leads to sorne rnode of the three processor case. This will happen 
when processors fail and are replaced in the active core by spares. A falled 
processor is now placed in a Spare position, however there is no possibility 


of restoring the failed processor to active status. The set of available 


Spares is therefore reduced by one processor. Continuation of this process 
leads to the depletion of the spares and thus effectively results in a rnode 
of the three processor case. In a like manner, the failure of links can 
isolate processors. The results are the same; effective reduction to sorne 
mode of the three processor case. 

The task of the operating protocol is a little different in the SIR systern 
with spares. The location process is even more meaningful in this case, 
because a history must be kept of all the faults that have occurred in the 
systern to date. The loss of a single link may cause the swapout of a 
processor. However the processor that is placed in spare status 1s still 
functional, and may be used in a different configuration of the active core 
so the core is always cornposed of fully functional cornponents. Of course 
this depends on the number of spares in the system (with the requisite 
number of links for full interconnection) and the fault history. With this 
possibility in mind, the algorithm that determines how the spares are to be 
managed is key to the reliability model of the system as a whole. 

The amount of system state information and the complexity of the 
system grows at an exponential rate as can be seen by the 2" figure for the 
number of system states. [The number of processors grows in a /inear 
fashion with the addition of spares to the system. The number of links 
grows at a greater rate however. Because the system Is composed of fully 
connected processors the addition of a single processor causes the addition 
of a number of links equal to the number of processors in the system before 
the addition of the new processor. Addition of the first spare causes the 


addition of three new links to the system. _ The second added spare Causes 
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Che addition of four new links. All the while, the complexity of the system 
in terms of system states is growing at 2” 

There are aggregations of system states that can be made in the four 
processor case similar to the way in which state aggregations were made 
for the three processor case. For the same reasons as given in Chapter V, 
there will be a number of impossible system states that can be immediately 
eliminated from consideration in the reliability model as well as the 
operating protocol 

The death states in the SIR system with spares will be a larger set of 
fault classes for several reasons. The obvious reason is that there are more 
components in the system. Secondly, the systern will always be in a death 
State when two of the active core of processors fail, even when there is a 
full complement of good spares. This condition occurs whenever the fault 
arrivals in the active core of components happens faster than the recovery 
peeeess, Inis possibility of system failure prior to the exhaustion of the 
Spare set emphasizes the need to rnake the recovery process a3 qUuIcK 435 
possible. When a link or a processor fails, the processors rernaining in the 
active core must replace a processor by an appropriate spare in order to 
return the active core to a fully operational cornponent set. 

The operation includes selection of the appropriate spare and testing of 
the spare and its link with the two processors selected to remain active (If 
the fault was a link failure, then there are two choices for which active 
processor will be placed ina spare status.) A process of updating the new 
addition to the set of active processors must be performed when an 
acceptable node is found. The DSPM state table must be transferred to the 


new node as well as the fault history in the SIR system. State information 
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for the application software must also be transferred, particularily past 
data input history that is used in calculating the reasonableness of a 
suspect new data word. All of this information is necessary and it's 
communication will consume time. Of course the data transfer must be of 
the invarient type for transfer of system state information. 

The TMR operation of the SIR system and the single fault arrival 
assumption assure that that there will not be more than one fault in the 
active set of processors unless there is no possibility to transition to a 
configuration (through the use of components in the set of spares) that 
contains less faults than in the present active core of components. When a 
fault arrives in the active core a search is undertaken to locate a fault free 
Spare node such that its links with respect to the two communicating nodes 
from the active set are also fault free. Which of the spares selected (if 
any) is dependant on the location of the detected fault in the active set of 
components as well as the state of both the spare nodes and the spare links 
(with respect to the active set of nodes). 

A question arises as to the number of states that are needed to 
completely specify all possible system states that can exist under the sik 
operating environment. A search for a replacement node for the active set 
is not undertaken until a fault that arrives in the active set of components, 
is detected, and located. Recall that the system nodes are fully 
interconnected but that the search for a configuration that will improve 
that of the active core (with respect to the number of faulty components) is 
performed only by the nodes currently in the active set. The links that are 
terminated only on spare nodes can be discounted because the state of these 


links are not observable by the nodes in the active set which are making the 
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decisions. This means that there are a set of three links which connect each 
spare node with the active set of nodes (see Figure 6-1). The number of 
possible states that can exist among the spare links (discounting links 
connecting spares) is thus on where n is the number of spares in the 
system. Because the single fault arrival assumption is not valid for the 
Spare conponents (the test interval for the spare cornponents could be rnuch 
longer than the test interval for active components) any of these states are 
possible when a fault arrives within active core. The fault that arrives in 
the active core will be detected and located, however because the 
probability of a fault occurring is equal among like components (nodes or 
links), all possible positions of that fault in the active core must be 
accounted for in order to assure that all possible paths through the 
Semi-Markov rnodel are considered. The number of states is then three 
times the number of possible states within the set of spares, or 3 x 2on 


which reduces to 3(8)!"- 
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Figure 6-1 SIS firckitecture with Spares 
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Suppose however that only the number of link faults for each of the 
‘spares is Known (with respect to the nodes in the active core). Note that 
three link faults between a particular spare and the nodes in the active core 
will isolate the spare and is thus equivalent to a failed spare (with respect 
to the current configuration of the active core). Four states are therefore 
required to indicate the information concerning each set (which consists of 
a spare and the links that terminate on both the spare and one of the nodes 
in the active core). Because there are n such sets, the total nurnber of 
states required is (4)" The total number of states required to describe the 
system Is then (4)" times the three possible positions of the fault that has 
just arrived in the active core, or 3(4)" 

Suppose that the following conjecture is true. All nonredundant state 
information is retained in the latter system state description. An example 


for which this is true is shown in Figure 6-2. 
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Assume a link failure has occurred within the active core and that there 
are 2 faulty links between spare S1 and the nodes in the active core. There 
are 5 possible ways that the 2 failed spare links could be configured with 
respect to the failed link in the active core. Two of the ways are 
topologically equivalent. This case arises when L2 or L3 is the failed link in 
the active core and LS1 and LS2 are the failed links associated with S!. 
These are equivalent topologies because in either case two of the active 
nodes each have | failed link with respect to the remaining active core and 
the spare 51 while the remaining node in the active core has 2 failed links 
associated with it. In the remaining case (with L1, LSI, and L52 faulty) 2 
active nodes have 2 associated faulty links. The same number of possible 
system topologies would have resulted by fixing the active link failure ana 
varying the 2 failed spare links in all possible combinations. 

If the state is represented by the unique number of failed links 
associated with the spares and the occurrence of a link failure in the active 
core, then all of the state information is still contained in the model only If 
three transitions are shown for the state recovery. Effectively, some of the 
State information is represented in the transitions leaving a state [he 
positional dependance of the failure in the active core actually represenis 
three states. The specification of which state is actually present in the 
three state aggregation is contained in the recovery transition that 15 
selected in leaving the three state aggregation. These transitions must 
correspond to fixing the failed spare links and varying the position of the 
active link faifure. In the case shown above, two of these transitions are 
equivalent so the probability of transitioning to the recovery state 


indicated by these equivalent topologies is 2 times as great as the 


probability for the transition indicated by the remaining topology (the 
active link failure is L2). 

An assumption will be made in this paper that the conjecture discussed 
above is, in general, valid. The number of states needed in the models are 
thereby reduced from 3(8)" to 3(4)" which is significant for cases of n as 


small as 2 (the five processor case). 


A. THE FOUR PROCESSOR CASE 

The four processor case of the SIR architecture consists of the three 
processor active core and a spare processor (and the links that connect the 
processors). Following the procedure that was developed in Chapter V, the. 
set of system states can be broken down into a set of good systern states, a 
set of death states, and a set of irnpossible states. 

Figure 6-3 shows the set of possible states in which TMR operation is 
possible within the active processor core. The notation used in the figure is 
Slightly modified, so that "S" represents a spare processor. 

The labels on the links were not added to this figure. No real inforrnation 
is provided by the labeling of the links because each of the fault classes 
represents all combinations possible by label changing of the particular 


faults composing the fault class. 
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Figure 6-3 Good system states for 
Four Processor Case 
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Several aggregations of the fault classes that are labeled good are made 
in the figure. Figure 6-3 (d) represents a class of “good” system states that 
will be assurned to be impossible states in the model for the four processor 
SIR system. The impetus for this decision is based on the assumption of 
Single fault arrivals in the components that cornprise the active core. A 
systern reconfiguration is always implemented, if possible, when a single 
fault is detected in the active core. The length of time that will be allowed 
to perform the reconfiguration will be limited to the maximurn Bounds on 
the control cycle. Chapter IY validated a single fault arrival assumption for 
the active component set, therefore under the single fault arrival 
assumption the set of fault classes shown in Figure 6-3 (d) are impossible. 
Each of the fault classes have more that one fault in the active core and a 
reconfiguration is possible to a core with a larger number of good 
components. 

The fault classes that are shown in Figure 6-3 (1.), (n.), and (0.) are each 
aggregate classes that effectively equate to the fault classes for the three 
processor case shown in Figure 5-5. Each of the aggregates shown in {1.), 
(n.), and (o.) are actually larger than indicated in the figure. For reasons of 
space in the figure, the three cases of a failed spare and the links 
Connecting it to the active core were not shown (one, two, or three failed 
links). The aggregations do not change for these cases because a failed node 
causes an effective failure in the links associated with it. 

The death states and the remaining impossible states for the four 
processor system do not need to be shown for cases other than the three 
processor case. The TMR requirernent for two active processors 15 


applicable only to the active core; any faults that take the active core to 


106 


less that two active processors that are able to communicate with each 
other is a death state. All combinations of spare processors and links in 
their good or failed states can be aggregated into this overall death state. 

The number of states that must be managed for the four processor case 
has been reduced from 2!9 to just 15 (including the death state). The task 
remaining is to develop an algorithm that effectively manages these states 
and sets up appropriate transitions between them. Given the algorithm and 
the state aggregations shown in Figure 6-3, a semi-Markov model can be 
constructed. 

1. Recovery Algorithm 

The algorithms that were discussed in Chapter V for detection of 
faults in the active core apply, with minor changes, to the case of the IR 
architecture with spares. The major difference is that once a fault has been 
detected and located, a search for a Spare is undertaken. The goal of the 
search is to find a SoHE that, when activated, brings the complement of 
good components in the core back to 6 (3 processors and 3 links). Short of 
this, a reconfiguration is desired that will connect three good processors 
with 2 good links. The final course of action is to use just 2 good 
processors and a single link connecting them, and power down the remaining 
processors in the system. 

For the case of one spare, there are not many possibilities to be 
checked in the search. If the detection algorithrn indicates that a processor 
has eiea then the spare is activated, and the failed processor 15 
deactivated. 

There are several cases that can occur when the spare is activated. 


The spare cornponents are in a deactivated state and it is not possible to 
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test these cornponents using the algorithrns described in Chapter V. The set 
Of spare cormponents is therefore subject to fault arrival rates greater than 
for the case of the active core. The spare Can be in a failed state, or either 
of the links, or both, connecting the spare with the remaining good 
processors can be failed. If Doth links are failed, the spare will behave as if 
it is failed. If the spare is failed, or both links associated with the spare 
are failed, then the two processors in the active core will deactivate the 
Spare and operate in a 2 processor mode. 

If one of the links between the spare and the active processors is 
failed, then a 3 processor/2 link mode of operation is set up. Note that the 
initial load of system state values to the spare must be validated as 
correct. To validate the state values, the spare relays the data back to the 
active processor with which the spare has a good link. At the sarne time the 
node that has 2 good links (the “center node’) sends the state information to 
the remaining active processor. In this way, the center node performs 4 
check on the state data contained in all three processors. Note that the 
center processor has become a single point of failure for the system. 

A similar process is performed for the case of a link failure detection 
in the active core. In this case, a decision must be made as to which of the 
two nodes coincident with the failed link is to be deactivated. If there is a 
link failure in one of the links that connects to the spare, then the choice is 
critical. The wrong choice leads to selection of a configuration that is not 
optimurn tor the set of nonfailed components. A way out of this delimma is 
to not deactivate a node. Instead, one of the nodes that 1s coincident with 
the failed link can be placed in a wait state by setting that node's watch dog 


timer appropriately. The two remaining nodes attempt to establish 
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communication with the spare in the manner indicated above. If a complete 
set of components for the core is established, then the active node in the 
Wait state is deactivated and held in reserve (only one link is faulty, there 
is still a chance that the node may have a good link with the newly activated 
spare). {if a complete set of core components is not achieved, then the 
process 1s performed again with the roles of the nodes that are incident 
with the failed active link reversed. Of course, this algorithm requires 
establishing communication and performing a synchronization with the node 
that is in the wait state. | 

The history of fault occurrences and where they are located 1s 
maintained in each processor. This enables past knowledge to assist in 
choosing the best strategy on the occurrence of a new fault in the active 
core. The algorithm described allows recognition of all of the states that 
are shown in Figure 6-3 (except for (d) which is an impossible state). 

2. Reliability Model 

A graphical display of the reliability model for the four processor 
case of the SIR architecture is shown in Figure 6-4 The model shows all of 
the possible transitions between the state aggregations that are snown In 
Figure 6-3. The notation used in Figure 6-4 is modified slightly tron that 
used in Cnapter V. Because of the complexity introduced in the system by 
the addition of the spare node and the associated links, it was not 
convenient to follow the horizontal and vertical paths to represent fault 


arrivals and recoverié 


ti 


respectively. Each of the transitions in Figure e-d4 
are instead labeled with A’s and «’s. The A's represent fault arrivals and 
ape fUpiner Classified 1G indi€ate which type of fault has occurred. The 


subscripts shown indicate a fault as being an active processor (P), an active 


link (L), an inactive link to the spare (SL), or the spare processor (3S). The 
recovery times are assumed to be identical for each of the fault classes. 
The labeling of the states in Figure 6-4 refer to the fault classes shown in 
Figure 6-3. 

The (f) fault class shown in Figure 6-3 was seperated into two 
distinct fault classes in Figure 6-4. If the link fault in the active core is 
the link between Pi and P2, a recovery can be made to state (i). This 
recovery restores a full complement of good components to the active core. 
The same recovery is possible if the faulty ‘link in the active core is 
between Pl and P3. If the faulty link is the one between P2 and P53, 
however, the only configurations possible in this case are composed of an 
active core with three good processors and two good links and no 
improvement can be achieved. This case is labeled (f2) in Figure 6-4 and the 
former two cases are labeled (f1). All recovery transitions frorn a single 
State are assumed to have equal probability. The probability of a particular 


recovery transition is indicated notationally as «,, where n equals the 


inverse of the probability of that transition (and the number of transition 
that leave that particular state). For example, in state (k) of Figure 6-4, 
there are three recovery transitions possible depending on the position of 
the failed processor in the active core. Two of these three transitions go to 


the same state and are represented a5 2x2. The remaining trransition has 4 


probability of 1/3 and is represented as x-. 
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B. THE FIVE PROCESSOR CASE 

The process des cribe d in section A of this chapter can be extended to 
architectures that start with a larger nurnber of spares. The algorithrn that 
is used for selection of a spare to replace a failed component in the active 
set can be used without change. Only the five processor case will be 
considered in this paper. 

The addition of an extra spare in the architecture will increase the 
number of fault classes in the systern state aggregation categories. [ne 
reasons for the increases have already been discussed. 

Figure 6-5 shows the set of irnpossible states that would be good states 
if the single fault arrival assuription were not applied to the active core 
cornmponents. This set of impossible fault classes cannot occur because the 
system will reconfigure to an improved fault condition for the active core 
by use of the recovery algorithrn described in section A of this chapter. 

The set of possible good states is shown in Figure 6-6. Nurnbers are used 
to label the state aggregations for the five processor case because the 
Increase in systern cornplexity rnakes the use of letters inconvenient. 

A sermi-Markov model can be constructed using the set of state 
aggreaations in Figure 6-6. This model is shown in Figure 6-7. Because the 
rodel for the five processor case is rnore complex than for the three or four 
processor cases, the representation of the model requires more roam te 
show all of the aqqredation states and transitions. Therefore, Figure 5-/ is 
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Figure 6-6. 
In several cases, the position of the fault in a single fault class chanaes 


the state to which recovery is possible, as has been previously discussed. 


The five processor case contains quite a few more states than for the four 
processor case. In one case the state aggregation was split into two 
Seperate states (50 and 51) in order to rore clearly show the differences in 


recovery possibilities. 
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Figure 6-5 (Continued) Impossible (Sood) Sysiem 
States For The Five Processor SIR Architecture 


114 





Figure 6-6 Good System Siates for 
the Five Processor SIR Architeciure 
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for the Five Processor SIR Architecture 
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VII. RESULTS AND CONCLUSIONS 


A. RESULTS 

The results of applying the SURE Semi-Markov analysis program to the 
reliability models developed in Chapters V and VI are shown in Figure 7-1. 
The scale used for the probability of system failure axis is logarithmic so 
that the results of each of the three models can be shown in the same 
figure. The graph was constructed by plotting I5 equidistant points for 
each model, varying the mission time from | to 15 hours. Cubic spline 
interpolation was used to estimate the shape of the curve. 
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The graphs in the figure show the improvement in reliability achieved by 
the addition of spares to the basic three processor case of the SIR 
architecture. The mission time that is required by the FAA for reliability 
calculations is 10 hours. Only the 5 processor case meets the FAA | 
requirement for a 107? probability of system catastrophic failure for the 
1Q hour mission time. (The exact numbers produced by the SURE program 
place the bounds on the failure rate to within 3.07550 x 107? and 3.13041 x 
may) 

The model that is developed in this paper can be extended to cases of the 
SIR architecture that use more than 2 spares. The algorithm that searchs 
for an appropriate spare (discussed in Chapter V) still applies for the case 
of the six processor architecture. As the number of spares increase 
however, there are two factors that combine to limit the increase in 
reliability that can be achieved by the addition of more spares. Ihe first 
factor is the increased time required to exhaustively search the system for | 
an optimum configuration for a given set of faults. The reason that the 
search time becomes critical stims from the topological richness of the 
node interconnections for the SIR architecture, and the fact that only two 
processors are active in the search for a third good processor (and the links 
connecting it with the two active processors). The two active processors 
that comprise the active core for the search can be circulated through the 


set of nodes until a good spare is found that has links to both the active 


CD 


nodes. [he circulating process is performed by deactivating one of thi 
nodes in the original set of two, substituting the newly activated spare in 
its place, and proceding in the search with the new set of two search nodés. 


This requires a complete communication of the system state to the newly 
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activated node prior to deactivation the node,selected for elimination from 
the active core. The newly activated node rnust also be tested to establish 
that it is fault free prior to deactivating the node in the original set of two. 
At some point, the time limit of the control problem will require that the 
search be stopped. Although the search process could be continued at the 
conclusion of the next control cycle, the introduction of the extra time 
invalidates the single tault arrival and reconfiguration assumption that has 
been made for the models developed in this paper. The model would then 
require a larger number of states, in fact the number of additional states 
required would be the states listed as good but impossible by the single 
fault arrival and reconfiguration assumption. 

A second reason that increasing the number of spares fails to improve 
the reliability is that the complexity of the link components increases as 
nodes are added to the system. The models developed in this paper all used 
a Standard link circuit that supports a 6 node system. There should actually 
be some improvement in the numbers that are used in the 3 and four 
processor cases (the 5 procesor case remains unchanged because the number 
of flip tlops required does not change). For systems larger than 6 nodes, a 
extra flip tlops and AND gates are required at each end of the link. This 
will increase the failure rate of the link component although the rate will 
still be less than that of the node. (The node failure rate is changed by the 
addition of % AND gates for each additional node to the system } 

It should be noted that the reliability rnodels that are aeveloped in this 
paper are based on a visual analysis of a semi-Markov State representation 
of the arrival of faults and recovery transitions in the SIR system. The 


model becornes very complex as spares are added to the system. A 
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conjecture was made in Chapter VII that signifacantly reduces the nurnber 
of states that are necessary for the semi-Markov riodel to capture alt 
nonredundant state inforrnation. There was not enough tirne rernaining to 
- completely prove the conjecture, but an assumption of its correctness 
allowed a visual analysis of the five processor case of the SIR architecture 
(2 spares). Without use of this conjecture, the nurnber of states necessary 
in order to cornpletely specify all state inforrnation would rake 4 visual 
analysis unworkable. 

The conversion of the SURE program code to operate on the IBM PC-AT 
was not completed in tire for use in this paper. fhe calculations 
graphically displayed in Figure 7-1 were performed with a version of SURE 
that runs on a VAX 11/780 minicornputer under the VMS operating svstern. 
The graphics package of the systern is not operable at NPS so the numbers 
that were produced by SURE for the three SIR reliability rnodels were 
plotted using DISSPLA, a collection of Fortran plotting routines installed on 


the IBM 30353 rainfrarne cornputer resident at NPS. 


B. RECOMMENDATIONS 

The conjecture presented in Chapter VII should be proven. Tne forma: 
Staternent of the conjecture made in Chapter VII snould be of assistance in 
the proof. The state aggregations that are possible by use of the conjecture 
still results in a reliability rnodel that grows exponentially in complexity 


pares are added to the SIR syvstern. Visual analysis would be of limited 
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benefit for systems containing riore than 2 spares. A cornputer alaorithin 
should be constructed to autornate the analysis procedures that are 


graphically displayed for the five processor case of the SIK architecture. 


AS 


Such an automated system for applying the results of the conjecture would, 
in conjunction with the SURE serni-Markov analysis prograrn, provide a very 
powerfull tool for the analysis of very cornplex systems. 

There are several changes that can be rnade to the architecture of the SIR 
systern that could irnprove the system failure rates predicted by the model 
presented in this paper. Contributions to component reliabilities rnade by 
cornponent subsysterns indicate that a prirne target for improvernent is tne 
memory that is used in the node's cornputer. The failure rate of this 
subsystem is an order of rnagnitude higher than for the rernaining node 
components. The use of such techniques as Harnming codes in the mernory 
element should be investigated. This technique of course requires that rnore 
bits be included in the mernory word which leads to greater complexity for 
the memory. The Hamming codes allow a particular mernory word to 
operated acceptably after the arrival of a fault to the circuitry that 
contains that word. This rneans that the mernory is fault tolerant for 2 
whole class of faults (I fault per word for a Harnming code that detects 2 
faults and corrects | fault). There are obviously trades offs that have to be 
analyzed. 

Another place where the node could be changed 15 In the design of tne 
interface between the SIR node's microprocessor and the node's interstage. 
The custom slave processor rnode of operation was selected to control the 
interstage and the cornrunications protocol operating between the 
microprocessor and the interstage This required a relatively cormpley 
controller elernent for decoding the cornrnands sent by the rnicroprocessor. 
The ROM used in the designed interstage controller contributed significantly 


to the failure rate of the interstage subcomponent of the SIR node. A 
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peripheral interface between the microprocessor and the interstage should 
be explored as a possibie reduction of required cornplexity in the interstage 
controller. 

Finally, gains could also be made in the manner in which the internode 
communications are performed in the SIR network. The design using 
seperate clocks and a 5 register interstage could be reduced by a rnore 
elaborate protocol between the nodes. A system of flags could be managed 


in software resident in the host microprocessor and the SIR nodes 


Cn 


interprocessor cornmunication could share the microprocessor ports that 
are reserved exclusively for external cornrnunication in the present desian. 
These approaches would result in a slower exchange of information between 
the SIR nodes, but the decrease in hardware complexity may result in 
increased reliabiiity tor the system. (Changes in the SIR node 
intercommunication hardware could require a modification to the mode! 


developed in this paper.) 
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